DOCDHEHT BESOHE 



BD 196 935 



TM BIO 050 



RDTHOR 
TITLE 

INSTITDTION 



SPONS AGENCY 
PDB EftTE 
CONTBACT 
NOTE 

EDBS PBICE 
DESCBIPTOBS 



IDENTIFIEBS 



A Layperson* s 



Andersoiir Beverly L,: And Others 
Educational Testing Facts and Issues 
Guide to Testing in the Schools, 
California State Dept. of Education ^ Sacramento. 
Office of Program Evaluation and Eesearch.; Nero and 

Portland, Oreg. ; Northwest Regional 
Portland, Or eg. 

Education (ED) , Washington, D.C. 



Associates, 
Educational 



Inc. , 
Lab. , 

National Inst, of 
Sep BO 
^400-79-0059 
56p-; For related 



documents, see IM BIO 047-049, 



MF01/PC03 Plus Postage. 

♦Educational Practices: *Educational Testing; 
Elementary Secondary Education; Lay People; Public 
Schools: *Testing Problems: *Test Interpretation 
♦Test Dse 



ABSTBACT 

This booklet addresses the role of testing in today's 
public education system, and presents a series of questions and 
answers which will be of particular interest to school board members, 
legislators, lawyers and journalists. These questions are grouped 
into two major categories: (1) test purposes and users; and (2) 
current testing issues* Current testing issues include how teachers 
view testing, why achievement test scores are declining, the meaning 
of the truth in testing legislation, the meaning of test bias, issues 
related to IQ testing, educational and legal issues surrounding 
niniouiD competency testing, and the evaluation of teachers in the 
schools. In addition, an annotated bibliography, a glossary of 
measurement terms, and a summary of common test scores are included 
that aid in the layperson's quest for information related to the 
issues. (Author /EL) 



* Beproductions supplied by EDBS are the best that can be made * 

* from the original document. * 

*****************,«t,«cj«c ******** 



EKLC 



Educational 
Testing Facts 
and Issues: 



Beverly L. Anderson 
Richard J. Stlgglns 
David W» Gordon 



a lay^persoitf s 
guide to ^ 
testing in 
the schools 




National Institute of Education, U.S. Education Department 
Contract No. 400-79-0059 



ERIC 



Coordinated by: 

Nero and Associaten, Inc. 

520 S.W. Sixth Avenue, Suite 820 

Portland, OR 97204 

Susan Rath, Project Director 

Materials developed by: 

Northwest Regional Educational Laboratory 
Assessment and Measurement Program 
710 S.W. Second Avenue 
Portland, OR 97204 

California State Department of Education 
Office of Program Evaluation and Research 
721 Capitol Mall, 4th floor 
Sacramento, CA 95814 ^ 



Ac k no w 1 edg ent e n t s I 

Special thsnks are due to the many workshop participants and sponsors 
who provided helpful comments during the development o£ this 
booklet. Appreciation is also due to the legislators! school board 
members, journalists, measurement specialists, lawyers, ^nd test 
publisher representatives who reviewed it, and Carol Oewitte who was 
responsible for its production. 

Designed and illustrated by Warren Schlegel 

Edited by Jane Loftus 



September 1980 



This booklet is intended co te used in conjunction *^ith workshops and 
seminars conducted by measurement specialists using the training methods 
described in Training Citizen Grouua on £ducational Testinc} Issues: A 
Trainer's Manual , developed under this same contract. 

These materials are in the public domain and may be reproduced without 
permiss ion . The following acknowledgement is requested on mater ia Is 
which are reproduced: Developed by the Northwest Reaional Educational 
Laboratory, Portland, Oregon and the California Department of Education. 

This booklet was prepared by the N'orthwest Regional Educational 
Laboratory, a private nonprofit corporation and the California Department 
of Education under a subconttact with ^Jero and Associates, Inc^, 
Portland, OR, The work contained herein has been aeveloped under a 
contract with the National Institute of Ecucation, U.S. Education 
Department pursuant to Contract :io. 40G-79-00 :i9/SBO40B ( a ) -79~C-1^7 . The 
opinions expressed in this publication do not necessarily tcMect t-he 
position of the National Institute of Education, and no official 
endorsement oy the Institute shoulo be infer roc. Mention cf trade namep, 
commercial products, or organizations does not imply endorsement by the 
Co. Government. 



4 



Table of Contents 



Paae 

INTRODUCTION 1 

OVERVIEW OF TEST PURPOSES AND USERS 3 

Who uses tests? 3 

What are the most common types of testa? 3 

What are the major purposes of testing? 4 

What are limitations of tests? 7 

Who Is responsible for Initiating testing? 8 

Who constructs tests? 9 

What are the costs of testing? 9 

CURRBJT TESTING ISSUES 11 

How do teachers view testing? 11 

Why are achievement test scores declining? 11 

What is the meaning of the truth in testing legislation? 13 

What is the meaning of test bias? 14 

What are the issues related to IQ testing? 17 
What are the educational and legal issues surrounding 

minimum competency testing? 21 

Are tests being used to evaluate teachers in schools? 26 

ANNOTATED BIBLIOGRAPHY 29 

APPENDIX A: A GLOSSARY OF MEASUREMENT TERMS 35 

APPENDIX B: SUMMARY OF COMMON TEST SCORES 45 



Introduction 



This booklet addresses the role of 
testing In today's public education 
system, and preaentfl a series o: 
questions and answers which will be of 
particular interest to school board 
members, legislators, lawyers and 
journalists. These questions are 
grouped into two major categories: 

• Test Purposes and Users 

• Current Testin^j Issues 
Before presenting these issues, a 

short scenario from a typical school 
may help in establishing a context for 
the role of testing in schools today. 

An interviewer recently visited a 
junior high school to learn more about 
the role of testing in the school. 
Walking down the hall, the first 
person the interviewer met was a 
student leaving a room marked with a 
sign "Testing - Do Not Disturb." 

The interviewer said, "Hi I I'm 
visiting your school. and want to find 
out what kind of testing is done 
here. It looks like you just took 
some tests." 

••Yes," the student replied. 
••We're taking a series of tests this 
week to find out what classes we 
should be taking. They just gave me 
some tests in math and reading. ** 

The interviewer asked a teacher 
about the testing that was being 
done. "Yes, we use those results to 
group students. But if a teacher 
disagrees with the placement of a 
student, the teacher's opinion is 
taken into account as well as the test 
results." 

After several more stops, the 
interviewer found that in the history 
and social studies classes, no 
standardized achievement tests were 
given; rather, all the testing done in 
those classes was designed by the 
classroom teacher. 



EKLC 



At the cUtii rict teatiny 
specialist's oirfice located at the 
junior high, the Interviewer diocuBSod 
the district testing program with the 
specialist. 

IWTERVIEWER ; What are tho major 
reasons for testing in your district? 

SPECIALIST : The districtwide testing 
is for three major purposes; first, 
to determine trends in student 
performance over the years; second, 
for program evaluation; and third, to 
determine student placement. 
Diagnostic testing is done at the 
discretion of teachers and 
principals, it is not determined at a 
district level. 

INTERVIEWER ; What types of tests are ' 
used? 

SPECIALIST ; Let me give you an 
example of what a typical student 
would experience in grades K through 
12. During their first two months in 
kindergarten, students are given a 
screening test. It is essentially an 
observation of a student's physical 
development, verbal and other academic 
skills. 

In grades 1 through 6, the student 
takes a standardized reading and math 
test each spring, in grades 7, 9 and 
11, the student takes a language arts 
test as well as the reading and math 
test. In grades 3, 7, 9 and 11, an 
aptitude test is given along with the 
achievement battery. The purpose of 
the aptitude test is to establish 
expected levels of performance on the 
achievement test. 

INTERVIEWER ; How many hours of 
testing do you think the typical 
student experiences? 



6^ 



aPISGIMiIST i Wall, the d Ifltclotwid© 
teating I montloned takes about two 
hourB in the Cirat grade with the 
amount of time inoreaaing progreaaivo- 
ly to nearly aix houra in the fifth 
grade. Prom the fifth grade on, it 
fluctuatea between four and aix hours. 

INTERVIEWER I What about atudenta who 
are having difficultiea in certain 
areaa or appear to be in need of 
apecial education? 

SPECIALIST ; Now you have hit on an 
important purpoae for teating. 
Students in apecial programs auch as 
Title I, Follow Through, or a 
bilingual program experience much more 
testing. Nearly all federal or state 
funded programs require program 
evaluation; typically, atudents are 
tested both in fall and spring for 
this purpose. We wish the testing 
could be coordinated with district- 
wide teating, but an evaluation 
frequently requires a different test; 
thus theae atudents take at least two 
more tests during the year. 
Furthermore, programs like Title I 
frequently require diagnostic testing 
throughout the year. Students in such 
programs may participate in double or 
triple the amount of testing of the 
typical student. 

INTERVIEWER ; I hear a lot about 
minimum competency testing. Are you 
doing such testing in your district? 

SPECIALIST : Not yet, but we will be 
starting next year. Our school beard 
feels that minimum competency testing 
will be veiy useful in identifying 
students who should receive remedial 
instructJ.on. They are still debating 
whether or not to require passage of 
the test for graduation. They have 
decided to wait until after next 
year's testing to decide. We have 
spent a lot of time this year working 
with teachers, administrators and 
community members to decide what 
competencies to test with the MCT, as 



we aall It. Wo matracMied with an 
oduaatlonal norvioe agoncy to prepare 
tho toHt once wg ha(i tho coinpotencioa 
and ttkiila IdentiL'lod. 

INTERVIEWER ; Are people concerned 
about cultural bias in teating? 

SPECIAIiIST: Yea, there is much talk 
about cultural bias. Unfortunately 
there are ao many different interpre- 
tations of what cultural biaa in that 
we have a very difficult time dealing 
with it. I'm going to a workshop next 
month on the topic which will hopefully 
help me determine how to handle this 
issue. Partly out of concern about 
cultural bias, we are seriojoly 
considering eliminating our aptitude 
teating, bwt I'm not ready to 
recommend that yet. 

INTERVIEWER : Another topic I am 
hearing more and more about is teacher 
evaluation and the use of t.e55ting for 
that purpose. Is that an issue in 
your district? 

SPECIALIST : Do you mean f:he use of 
student test scores in evaluating 
teacher performance or actually 
testing teacher r jmpol.encies? 

INTERVIEWER ; I was thinking of the 
former but both topics are of interest. 

SPECIALIST ; Because of the many 
problems inherent ir, using student 
test scores for teacher evaluation, we 
do not use them for that purpose. We 
are getting pressure from parents, 
however, to at least consider looking 
at the scores of students over several 
years when a particular teacher's 
performance is questioned. As far as 
testing teachers, we just started 
giving teacher applicants a test of 
basic skills competencies. Teachers 
already in the district are not tested. 

The district described in 
'.his imaginary interview is meant to 
be representative of many districts 
across the country. The issues raised 
here are discuss< i in the following 
pages. 



2 



Overview of Test Purposes and Users 



Who uses tests? 

Testa are used by many people- 
Teachers use teats to determine 
students' progress in learning 
specific skills, parents use test 
scores to tell them how their child is 
doing in school or to see how their 
school compares with other schools. 
School board members and legislators 
use teat data to help set policy and 
allocate funds. School principals, 
guidance counselors, district 
personnel and state department of 
education staff also require 
informatin on how well students are 
learning. News reporters often 
request student test scores for 
reports on quality of schoo?,s. 
Lawyers may find test bcores to be 
important in certain legal cases. 
State, federal, or private agencies 
which fund special programs often 
require student test scores to 
evaluate the program's effectiveness* 
Andf of course, students use test 
scores to determine if they are 
learning what they are expected to 
learn. 

What are the most common types 
of tests?* 



There are f?«veral types of 
measurement devices used in the 
schools. Some tests measure knowledge 
and skills and some measure other 
characteristics. There are two main 
types of cognitive measures used in 
today's elementary and secondary 
schools— achievement tests and 
aptitude tests. Other measures such 



*See Appendices for a glossary of 
measurement terms and descriptions of 
test scores. 



as attitude inventorioo and interest 
inventories are also used. 



ACHIEVEMENT TESTS 

These tests measure how much a 
student has learned or what skills the 
student has acquired. Achievement 
tests are developed by teachers for 
classroom use or by test publishers 
for use by schools and school 
districts in large-scale testing 
programs. In either case, the test is 
developed by outlining the material to 
be tested and writing test items 
representative of that material. 
Achievement test scores are used by 
teachers and students to help plan and 
manage instruction (diagnose 
weaknesses, assign grades, etc.), to 
certify mastery of minimum essential 
skills, to select students for 
admission to college, to plan career 
directions, ai.d to evaluate the 
quality educational programs. 

Achievement tests come in two 
basic forms: those used to compare 
one student's learning with that of 
another student and those used to 
determine if a student has mastered 
particular knowledge and skills 
regardless of how other students 
score. Many achievement tests given 
are standardized. These tests cover 
material taught in most schools in 
subject matter areas such as reading, 
language arts, mathematics, science, 
and social studies. Once developed, 
the tests are administered to large 
national samples of several thousand 
students. Student performance is then 
analyzed and ranking by scores is 
established. These comparative or 
norm referenced tests are then used at 
the local district level where they 
allow the comparison of student test 
scores within the district. For 



3 S 



ERIC 



example , a student may be at the 40th • 
percentile compared to a national norm 
group, but at the 50th percentile 
compared to a local norm group, Thla 
would indicate that the district as a 
whole was performing lower than the 
national group. 

Norm referenced testa are used to 
select students for remedial or 
advanced programs, in addition, these 
tests are used as a guidance tool for 
the long-term educational and 
vocational planning of the student. 

Achievement tests can also show 
the quantity of specific knowledge and 
skills (learning objectives) that the 
student has mastered. These tests, 
known as criterion or objective 
referenced tests, are most useful for 
diagnosing specific strengths and 
weaknesses in individual students, for 
certifying mastery of minimal 
competencies, and for evaluating 
specific educational programs. 
Objective referenced tests are most 
often developed by teachers. However, 
nearly all major test publishers have 
objective referenced tests available. 
In some cases, test publishers may 
provide both objective and norm 
referenced interpretations for the 
same test. Increasing numbers of 
local districts employ testing 
specialists to develop their own 
objective referenced diagnostic 
tests — either for districtwide testing 
or for local diagnostic use by 
teachers. Some states, California, 
Michigan, Oregon, Texas and New 
Jersey, among others, are also 
developing objective referent Rd tests 
for statewide assessment purposes. 



APTITUDE TESTS 

Aptitude tests are designed to 
measure the ability to do school 
work. These tests can measure the 
ability to use language, to solve 
problems, to deal with mechanics and 
to think in terms of mathematics. 



These abllitlot^ aro not inherent or 
unchangincj. Thay CAn Ije inl!lu«naod by 
many factorsa GKpQCience, family, 
culture, emotions and health* 
Aptitude relates to achievement in 
that abilities provide a basis for 
achieving. Aptitude influences the 
amount of learning that takes place. 
Aptitude test acotea ace commonly norm 
referenced or comparative. 

A summary of the various test 
scores commonly used for the different 
cognitive measures is presented in 
Appendix B. 



ATTITUDE INVENTORIES 

Another common test investigates 
how students feel toward school, or 
toward a particular subject or person 
within the educational system. Such 
inventories are frecjuently used in 
evaluating special programs. Seldom 
are they administered districtwide. 
While such measures ace available from 
commercial publishers, these 
inventories are usually developed 
locally to answec questions of 
interest to a particular district. 
They often have low or unknown 
validity and even when appropriately 
used must be interpreted cautiously 
and in conjunction with other data. 



INTEREST INVENTORIES 

These instruments attempt to 
pinpoint any interest that may 
influence a student's learning or 
career plans. Usually a guidance 
counselor or teacher has responsi- 
bility for interpreting the results. 

What are the major purposes of 
testing? 

Tests are used for three 
purposes: insttuctional management, 
entry-exit decisions and programmatic 
decisions. Instructional management 



4 g 



and entry-exit cleolalon« require teHt 
data for each student* Proqrammlnq 
declalona can be made baaed on group 
data I which allows a aampling ot 
students rather than testing every 
Btudent. 



INSTRUCTIONAL MANAGEMENT 

Tests play an important role in 
instructional management decisions* 
Data from these tests are used for the 
diagnosis of students' strengths and 
weaknesseTi, student placement, and 
educational-vocational student 
guidance* 

Diagnosis * Perhaps the most 
frequent use of tests is to diagnose 
the educational development of 
individual students. Here, the 
teacher is the primary decision maker, 
although students may also be 
involved* Teachers often use tests 
and other performance indicators to 
assess the student's current 
development so that the next, most 
appropriate instructional unit is 
selected* Tests useful in diagnostic 
decision making are those that reveal 
precisely what skills and knowledge 
the student has or has not mastered. 

Placement . If diagnosis 
determines what instructional units 
within a course a student needs to 
master, then placement groups the 
student according to the next level of 
instruction best suited to that 
student's skills* In this case, the 
decisions are made by administrators, 
teachers, and guidance counselors who 
must place each student in the most 
appropriate course* Math tests, for 
example, might be used to place 
students at the appropriate level in a 
high school math course sequence* A 
test which indicates student ability 
in math will ensure that dents will 
not be assigned to course ;Kiich are 
too advanced or too elementary for 
them* Placement tests usually cover a 
broader range of knowledge and skills 
than diagnostic tests and are only 



iuu)d onae or twice a ye^r, Dl.:^finuat ia 
hQfltM may be luied on a day-to-day 
l)anln. However, complotlon ot: gradofl 
and ooiKseo are al«o conaldorod in 
placement decisions. 

Testing la the major method used 
to Identify students who would benefit 
from placement in special programs 
(bilingual programs, special education 
programs, remedial reading and math) or 
particular educational experiences. 
Standardized achievement tests are the 
most frequently used measures for 
placing students in compensatory 
education programs* In addition, 
aptitude and psychomotor tests are 
often used to identify students who 
need special education* 

Guidance . While diagnosis matches 
the student to an instructional unit, 
and placement matches a student to a 
course, guidance can determine an 
entire program of study. Here, 
students and their parents assisted by 
guidance counselors make the 
decisions* When students decide which 
educational and vocational program to 
pursue, they must consider their 
chances of success and satisfaction. 
These career planning decisions, 
typically made in junior and senior 
high school are assisted by the use of 
tests that cover broad academic areas 
and tell the students where they stand 
in relation to other students. These 
tests scores can also determine 
students' strengths and weaknesses 
which will aid them in making 
choices* Test scores, of course, 
should never serve as the sole basis 
for any guid; decision. The 
student's aca. ; record, interests 
and aspirations all merit consider- 
ation» 

Guidance testing, which is 
generally determined by school or 
district administrators and guidance 
counselors, is usually a secondary 
result of placement or diagnostic 
testing* 



RNTRY OR EXIT DEGiaiONfl 

it a atudont ohould be placad In an 
oduoatlonal program ot to determine it' 
a student had completed a program's 
requlrementa« For example, teata may 
be admlnlateied in order to select 
atudenta Cor programa with limited 
enrollment (e,g.f college entrance or 
trade school) i or to certify minimum 
competencies (e.g., for high school 
graduation or occupational licensing) • 

9el ection * The difference between 
selection and placement is not always 
clear. Placement, as previously 
described, groups students in the most 
appropriate level of instruction. 
This is an instructional management 
decision. Selection refers to a 
process whereby students are screened 
for admission to an educational 
program which has a limited number of 
participants. Admission is based on 
who is likely to benefit. Here, the 
key decision makers are teachers and 
administrators. A test used for the 
purpose of selection focuses on 
students' skills and knowledge 
considered essential for success in 
the program, and compares students' 
relevant skills and knowledge so that 
those most likely to succeed are 
identified. Admission to college or 
into a particular course (for example, 
airline pilot training) are prime 
examples of selection. However, test 
scores are not the scle basis for 
selection decisions. Previous 
academic record and other performance 
criteria may also be considered. 

Perhaps, the most common use of 
selection testing is the college 
entrance examination. Colleges 
require a specific entrance examina- 
tion and interested students register 
with test publishers who carefully 
control the administration of the 
tests at various locations across the 
country. 

Certification . Tests often play 
an important role in certifying 
acceptable minimum levels of 



For enampla, ^ toaohav tx 

verbal oklllH required toi' oomplatlon 
oC a ourtaln coui;««. Ov , (lljn,rlaU 
adminlatrator applylnr] Board ot' 
Ijlduoation graduation atajKlacda mUjhU 
use an examination in order to toint 4 
Htudent'a mantwry of minimally 
acceptable akilla. Ori membera of a 
certain technical proCeaaion n^ight uae 
a teat to certify competence in hhai: 
profession. Since, in each case, 
those taking the exam muat paaa the 
test to be certified, the teah na<?:t 
focus specifically on clearly stated 
minimal competoncies. 



PROGRAMMATIC DECISIONS 

A third use of teats is to assist 
in program planning. In this 
instance, test data may be helpful in 
providing the baais for developing a 
new program, allocating funds or 
evaluating existing programs. Such 
testing falls into three categories; 
survey assessment, formative program 
evaluation and summative program 
evaluation. 

Survey Assessment . Probably the 
most common use of testing in 
education is to survey student 
achievement and analyze tirends over 
time in order to assist in prograin 
planning. This kind of testing is 
usually designed to raise issues for 
further investigation. For example^ 
the test results might prompt such 
questions as, why are math scores 
gradually declining in the district 
(or state or nation)? Or, why are 
reading scores of fourth graders 
consistently below national averages 
while those in other grades are above 
average? The test data are used to 
identify which aspects of the 
educational system need to be more 
thoroughly investigated as well as 
possible reasons for unsatisfactory 
performance. For this purpoj^e, 
achievement test scores — sometimes 



6 IJL 



Prom c^nfk^fl SAWpl^s of Mtud^nt^-^ 
qAfeh^f^rt annually I t\\m\ 4versqe4 
ftGtoas the *Jntii:0 sphwl, i1iat*:l«t oi 
«fedtQ, Ami uflmi to iivUcTAt^ t\n Uvdl 
of «tud©nfe t1«v«loj;>mfin'., in oviUv lo 

frequently oompi^red ivm year to 
ye^r. Thla Information then becomea a 
b«(»l» £01: ^fitting mluo^tlon^l ix^Uoy 
«ncJ alloofttlng funda^ Typlqally, 
educational admin latcatora are the 
primary doolalon makers, but they mUMt 
Jufttify thooe declflloim to the 
ultimate docialon makt^r, the 
taxpayer. Teuta used to aaaeBa an 
educational program muat cover broad 
content and skill areas in order to 
provide valid information for program 
changes. 

Formative Evaluation . In 
formative evaluation y the goal is to 
determine which instructional units or 
features of a specific educational 
program (e.g., remedial reading) 1 are 
effective and which need revision. In 
this instance^ testi> are used to 
measure what the students learn in a 
specific program and the results are 
used to help shape or revise the 
program during its formative stages. 

Summative Evaluation . Suiwnative 
evaluation reveals a program* s overall 
merit, and suggests whether or not a 
program should be continued, 
terminated, or expanded. Tests 
designed to assess knowledge gained 
from a program are an important part 
of such an evaluation. Teachers, 
program, building or district 
administrators, and the public, 
represented by the board of education, 
may be involved in summative 
evaluation decisions. Tests may be 
given both before and after 
instruction, with retesting after an 
interval to determine the student's 
retention of knowledge. 

It should now be obvious that 
tests are used for many different 
purposes in education. Many decisions 
using test data affect individual 
students, while other decisions affect 
whole groups. The implications of 



EKLC 



thet^d V4ry, Him^ ht>VM 



What are tho limitations of toata? 

Teet WHUfti should t3on§lr|^r t;hAt 
t«ats ropreaent only one of mtiny Vyp^t* 
o£ performance indiq^^tora. In tht^ 
olaaaroom, day-to-day oia«Hroom 
aotivltlea arid, olaaawork represent 
important and valuable aouroea oC 
information about student development 
that should be ua«d to supplement teat 
information in making educational 
decisions. Testa are alao supplemented 
with professional teacher judgments. 

Tests are designed for certain 
uses; a single test cannot serve all 
purposes. Tests are limited in terms 
of the range of decisions they can 
help with. Generally I a test is 
capable of assisting in one or two of 
the decisions previously discussed. 
The key to using teats effectively is 
to know what decision is to be made, 
to determine what material needs to be 
tested to aid that decision and to be 
certain that the test used actually 
covers that material. 

Tests are also limited in the 
material they cover. Generally, tests 
cover only a sample of the content or 
skills taught. It is almost never 
feasible, both in terms of time and 
money, to test every aspect of the 
subject matter taught. As a result of 
this sampling procedure, as well as 
uncontrollable factors such as 
motivation and fatigue, test scores 
are subject to some variability. That 
is, if the same test was taken twice 
by the same student, the score might 
vary slightly due to the imprecision 
of the test. Therefore, a score 
should seldom be seen as completely 
precise or unchanging. Rather, it 
should be seen as a general 
performance index. 



12 



Another limitation of tests is 
that they are easy to misuse. They 
are readily available and relatively 
easy to construct, especially i£ 
quality is disregarded. Therefore, 
they are easy to misuse. Misuse can 
only be avoided by knowing precisely 
how the test score is to be used and 
by selecting or building a test 
specifically designed to serve that 
purpose . 

Who is responsible for initiating 
testing? 

Often it is assumed that tests are 
initiated, for the most part, by 
teach iTS who need information to 
improve instruction. This is 
generally true, however, mainly of 
teacher-made tests and curriculum- 
related tests. It is not the case 
with most standardized tests or 
district and state-developed tests. 
Decision makers at all levels — federal, 
state, and district—need information 
from these tests. 

At the federal level, the primaury 
impetus for testing comes from 
federally-funded special programs, 
which usually require the evaluation 
obtained by using standardized 
achievement testing. Title I of the 
Elementary and Secondary Education 
Act, which provides funding for 
compensatory education, is a case in 
point. As the largest single item in 
the United States education budget. 
Title I programs are subjected to 
rigorous evaluation to demonstrate 
effectiveness. Although current Title 
I evaluation procedures require local 
programs to either use standardized 
tests or the combination of nonnorraed 
tests and a standardized test, 
specific recommendations for which 
particular tests to use are carefully 
avoided. 

At the state level, the most 
common reasons for testing are 
statewide assessment for accounta- 



bility, minimum competency, and for 
evaluation of state-funded special 
programs. Legislators, who wish 
evidence that schools are doing the 
job they're being funded to do, often 
call for statewide assessment 
testing. The late 1960 's saw many 
such assessment programs established. 
Following the state assessment 
movement was the public outcry for 
students to achieve certain minimum 
competencies before high school 
graduation, in response, at least 38 
states have enacted legislation 
requiring minimum competency testing. 
Evaluation of state-funded special 
programs also provides an impetus for 
state-level testing. 

Generally, federal and state 
regulations allow state and local 
education agencies considerable 
latitude in setting their own testing 
procedures. For example, although 
Title I evaluation requires the use of 
standardized tests, many different 
standardized tests are available. 
Although states may put some 
limitations on which tests are 
acceptable, final selection is 
generally a local decision. 

Most district-initiated testing is 
done to ensure accountability, to 
place students in special programs, to 
evaluate program results, and make 
instructional management decisions. 
Typically, the district decides which 
test is to be used for evaluating 
federal- and state-funded special 
programs. District level testing 
policy beyond that required by federal 
and state regulations is determined by 
many factors: public pressure for 
accountability, teacher and 
administrator demands that tests be 
reflective of program goals and 
content, pressures from teachers ' 
associations to avoid using student 
test results in teacher evaluation, 
and requests from teachers and 
administrators to reduce the amount of 
testing. District administrators and 
school boards are frequently in a 
quandary when establishing a testing 



« 13 



program that responds to these 
conflicting pressures. At the 
building level, the amount of 
additional testing beyond district 
requirements varies greatly. 
Generally, districts allow schoDXs 
considerable autonomy, and the 
principal's perspective on testing can 
be a major influence. 

At the classroom level, teachers 
as individuals or teams often conduct 
additional testing at their 
discretion. Some teachers employ 
comprehensive diagnostic systems, 
particularly in the basic skill areas 
of reading and math. They also may 
administer unit tests which accompany 
textbooks. Teachers generally need 
more diagnostic test information on 
lower performing students than on 
others. 

In general, frequency of tests is 
determined by federal, state and 
district mandates for evaluation, 
accountability, student placement and 
certification rather than by requests 
from teachers or local administrators. 



Who constructs tests? 

Until recently, tests \9ere almost 
exclusively constructed by either the 
classroom teacher or the commercial 
test publisher. But within the last 
15 years, state departments of 
education and local school districts 
have begun to develop their own tests. 

Classroom teachers generally 
construct tests to measure the 
specific instructional content being 
taught. These tests often take the 
form of a short weekly quiz, a 
mid-term excunination or an end-of-the- 
course test. The test results are 
primarily used for grading or for 
helping students identify specific 
course content which they have not 
mastered. 

The most frequently used tests 
developed by commercial publishers are 
the standardized achievement and 



EKLC 



aptitude measures. These tests 
require careful development of 
questions as well as extensive admini- 
stration to establish interpretable 
test scores. During development, 
tests are administered to a carefully 
selected sample of students in a 
specified age or grade level. The 
results are used to establish scales 
which permit comparison of a student's 
score to national averages. The 
develogmient of these "normative" 
scales is a costly process. 

Commercial publishers also develop 
criterion or objective referenced 
tests. These tests are not tied to 
any one textbook series, but are 
focused on particular knowledge or 
skills that can be taught by a variety 
of methods or materials. These tests, 
for example, may measure a student's 
ability to add whole numbers regard- 
less of the textbook or method of 
instruction used. 

Publishers also develop tests 
which are contained in or related to 
specific textbooks. These tests, 
which may be used at the end of a 
unit, are tied to information in a 
particular text or set of curriculum 
materials. 

The tests developed by state 
departments of education and local 
school districts are frequently 
designed to measure the school's 
success in teaching course content 
considered important in that state or 
district. Publishers' tests, based on 
the content most frequently taught 
across the nation, may not exactly 
match local curriculum content. Such 
tests should be carefully screened and 
selected to match local needs. 



What are the costs of testing? 

The actual cost of testing varies 
with the type of test used and its 
origin. For instance, objective tests 
scored by counting the number of test 
items answered correctly, and 




performance tests which require the 
observation and evaluation o£ a 
process or product b/ a qualified 
judge differ in cost. These tests may 
be purchased from a test developer or 
test publisher, or they may be 
developed by local educators for local 
use* The costs of testing depend on 
the combination of these factors* 

In all cases, there are three 
categories of costs: developmental 
costs, costs of test administration, 
and test scoring costs* 

When an objective test is 
purchased, developmental costs include 
(1) the cost of time required to plan 
the testing context which includes 
thinking through the decision to be 
made and the kind of test needed, (2) 
the cost of time to review available 
tests, and (3) the costs incurred in 
actually purchasing test booklets, 
answer sheets, administration manuals, 
etc* Test administration costs will 
include time to (1) plan test 
administration, (2) train test 
administrators, (3) coordinate 
distribution of materials, anc3 (4) 
administer the test and collect 
materials* Test scoring costs include 

(1) the time required to count the 
items answered correctly or (2) costs 
of optical scanning and computer 
scoring of answer sheets* There are 
also costs involved in disseminating 
the scores and interpretative 
information to the decision maker in a 
timely manner* 

When an objective test is to be 
developed locally for local use, 
developmental costs include time 
required to (1) plan the test context, 

(2) write the test items, and (3) 
assemble the final test* If the test 
is to be used for very important large 
group decisions such as certifying 
proficiency for graduation, additional 
developmental costs will be incurred 
to pilot test the items before they 
are used in order to ensure a high 
quality test* Test administration and 
scoring costs will be the same as 
those previously discussed. 



When a performance-based test is 
to be used, the scoring becomes more 
expensive because qualified judges 
must be used to score the test* When 
such a test is to be purchased, 
developmental costs include (1) time 
to plan the test context, (2) time to 
locate, review and evaluate available 
test exercises and scoring (rating) 
procedures, and (3) the costs of 
purchasing test materials* Test 
administration costs will generally be 
the same as those involved in the 
objective test* Test scoring costs, 
when such tests are used on a large 
scale, include time required to (1) 
plan scoring procedures, (2) select 
judges, (3) train judges, (4) score 
the test, and (5) process scores for 
the decision makers* Individual 
classroom use of these tests requires 
only planning the scoring procedures, 
scoring the test, and preparing 
results* 

And finally, when a performance 
test is to be locally developed for 
local use, the test developer must (1) 
plan the test context, (2) develop 
exercises, (3) plan scoring standards 
and procedures and (4) conduct quality 
control research (for large-scale 
use) * Test administration and scoring 
costs will be the same as those 
discussed above* 

The point is that there are real 
and significant costs associated with 
sound (fair and useful) testing* 
However, money spent for good 
assessment will pay dividends in the 
form of high quality educational 
decisions* 



10 



15 



Current Testing Issues 



In view of the variety of test 
purposes and users previously 
discussed, there are several important 
issues that need to be addressed. 



Issue 1 : How do teachers view testing? 

Throughout the educational 
comnunity there is growing concern 
about the role of testing in the 
schools. At all levels - federal, 
state, and local - educators are aware 
of the possibility of overtesting. 
Administrators are reviewing testing 
programs to ensure that the fewest 
number of tests are being used and 
that the purposes for testing are 
clearly defined. Teachers as well as 
other educators are opposed to tests 
which damage a student* s self-concept, 
perpetuate negative expectations, are 
biased against economically disad- 
vantaged students or students with 
different cultural or linguistic 
backgrounds, or which are used as the 
basis for inappropriate comparisons of 
students or schools. Many educators 
are also opposed to the use of 
standardized tests for teacher 
evaluation and are particularly 
concerned that tests not be used as 
the sole criterion for important 
educational decisions. They are, 
however, supportive of testing to 
diagnose learning needs, prescribe 
instructional activities and measure 
progress in the curriculum content 
using tests prepared or selected by 
classroom teachers. Two major 
teachers* associations, the National 
Education Association and the American 
Federation of Teachers have taken 
steps to investigate the issue of 
testing* For excunple, the National 
Education Association last year 
published two booklets. Parents & 



Testing and Teachers & Testing (see 
bibliography) to assist its members in 
understanding testing issues. The 
American Federation of Teachers is in 
the process of preparing a handbook to 
improve understanding and use of 
standardized tests in the classroom. 

Issue 2: Why are achievement test 
scores declining? 

since the mid-1960s there has been 
a well-publicized decline in the 
achievement test scores of students in 
the United States. This decline has 
been found in nearly all suhiects and 
all regions of the country, in 
almost all national testing , . jgrams, 
ranging from college entrance tests to 
elementary school achievement test 
batteries. Although precise amounts 
of score decline are difficult to 
determine, declines tend to be more 
pronounced through the higher grade 
levels and there seem to be 
differences in decline between male 
and female students. As we move into 
the 1980s, there is some evidence that 
the decline may have leveled out, but 
year to year test score patterns will 
have to be carefully observed in the 
future . 

During the mid and late 1970s, a 
great deal of educational research 
focused on reasons for the decline. 
Early studies dealt with explanations 
related to test characteristics, 
hypothesizing that the decline might 
be a technical, rather than a real, 
phenomenon. These hypotheses were not 
supported^, leading to the 

^See Modu, CO. and J. Stern. The 
stability of the SAT score scale. 
Research Bulletin RB-75-9 , April 
1975,. Educational Testing Service, 
Berkeley, CA. 



o : 



conclusion that the decline was a real 
and significant socio-educational 
fact. Subsequent efforts focused on 
social-educational reasons for the 
decline. 

One example is the work done at 
CEMREL, a research institute in St. 
Louis. In this study (consult 
annotated bibliography for complete 
reference) , researchers collected and 
summarized evidence on the test score 
decline and sought possible causes in 
the school environment. Information 
was gathered and interpreted on the 
potential role of such factors as 
curriculum, course enrollments, and 
amount of schooling, as well as 
television watching and family 
background and environment. The 
researchers concluded that there is no 
evidence of changing teacher qualifi- 
cations, and school organization and 
student motivation do not seem related 
to the decline. However, there is 
evidence of declining drop out rates 
accompanied by increasing absenteeism. 
This has the effect of leaving more 
low-achieving pupils in school. There 
is also evidence of a pronounced 
decline in the number of and 
enrollment in academic and college 
preparatory courses in high schools. 
In addition, some evidence was found 
that such non-school factors as TV 
watching, drug use, and family 
structure are potential? contributors 
to the decline. From these initial 
exploratory efforts, the researchers 
concluded that there are many causes 
for the score decline and much added 
research is needed to provide a more 
concrete explanation for achievement 
drops. 

Two additional attempts to find 
explanations for the declining college 
admission test scores were conducted 
by the College Entrance Examination 
Board (CEEB) and The American College 
Testing Program (ACT) . CEEB formed an 
advisory panel of noted scholars and 
educators to examine the decline in 
Scholastic Aptitude Test (SAT) 
scores. After a year of study, the 



committee concluded that the c^ecline 
can probably best be explained in 
terms of changes in the population of 
students taking this particular test 
and changes in the socio-educational 
fabric of the United States. Sines 
SAT and ACT tests are taken by a 
select group of students, the panel, 
concluded that the current SAT tested 
group is more broadly representative 
of American youth today than it was a 
decade ago when colleges were being 
more selective. Factors discovered to 
influence the socio-educational 
environment included increasing 
electives in high school, declining 
seriousness of educational purposo in 
society, television watching, changing 
family roles, the social unrest of the 
early 1970s, and motivation of 
students. 

ACT assembled evidence of 
declining aCt Assessment Program test 
scores and combined it with evidence 
from other national testing proorama 
to conclude, as had CEEB, that the 
college bound student population is 
changing. With more middle and low 
achieving students now considering 
college and participating in college- 
entrance testing — because of available 
opportunities and financial aid — the 
effect has resulted in a lowering of 
the average test score. In this 
instance, the test score decline could 
be interpreted as evidence of 
increasing diversity in educational 
opportunity — a positive statement — 
rather than an indictment of the 
educational system. 

The conclusion from these studies 
is that there is no single explanation 
for the decline in test scores. 
Rather, a large number of complex 
factors has caused the score patterns 
we now observe. However, even in the 
absence of a clear explanation for the 
decline, the publicity it has recei 'ed 
has had a pronounced impact on 
schools. That impact has been felt in 
testing and instruction* Teachers 
have carefully scrutinized the tests 
used to show declining achievement and 



12 



have challenged their appropriate- 
ness. And in response to the demand 
for alternatives, newly developed and 
specifically focused minimum 
competency tests covering relevant 
school and life skills have emerged. 
The effects on instruction have also 
been profound. Much more attention is 
being given to basic skills 
instruction in reading, writing and 
math from elementary school through 
college. 

Issue 3: What is the meaning of the 
"Truth in Testing" legislation? 

The debate over "truth in testing" 
resembles many of the arguments over 
consumer protection laws in the 
1960s. At the center of the debate 
are two definitions of "fairness." On 
one side are the proponents of 
disclosure legislation, who argue that 
as a matter of simple fairness 
students should be able to see the 
test instrument (including the 
questions, the answers and related 
test data) used to make important 
decisions about their lives. 
Proponents feel that tests are social 
policy instruments that should, in a 
democratic society, be open to 
scrutiny. The opponents of such 
legislation argue that test security 
insures fairness, so disclosure of the 
tests will, by breaching security, 
affect the validity of the tests, 
increase the costs and lessen college 
admissions officers' confidence in 
standardized tests, all of which will 
make fair decision-making more 
difficult. They feel that secure 
standardized tests give everyone an 
equal chance and are more democratic 
instruments for policy making than are 
alternatives that permit the 
introduction of various biases. 

Proponents of the legislation 
believe that the principle of fairness 
outweighs technical objections to open 
testing. They contend that security 



ERLC 



is not essential for test validity and 
that the burden of proof rests upon 
the test companies. Specifically, 
they ask that the test companies prove 
their allegations that full disclosure 
will weaken test validity, increase 
development costs, exhaust the number 
of test questions that can be asked, 
erode confidence in tests and lead to 
unfairness in decisions that involve 
test scores. 

Opponents of the legislation, on 
the other hand, argue that the burden 
of proof rests upon the supporters of 
testing legislation. They ask for 
proof that the allegation that a 
substantial problem with test use or 
abuse exists, that the legislation 
will correct any misuses and abuses, 
that the added complexity of test 
development required for open testing 
is necessary and that substantial 
benefits will accrue to individuals 
and society through test disclosure. 



CURRENT LEGISLATIVE ACTION 

The first law requiring test 
publishers to disclose information to 
test takers and the public was 
California's SB 2005, enacted in 
September 1978. The law applies to 
any standardized test used for 
postsecondary education admissions 
selection of more than 3,000 
students — in other words, such tests 
as the Scholastic Aptitude Test (SAT) 
and the American College Testing (ACT) 
Assessment. The law requires that a 
test's sponsor must file with the 
California Postsecondary Education 
Commission various kinds of data 
describing the test's features, 
limitations and use; must provide test 
takers with various kinds of 
information about the test and how it 
will be used; and must submit data 
about the administration of the test, 
the income realized and the expenses 
incurred in its administration. 

New York enacted a similar law in 
1979. Like the California law, it 



13 
1 O 



applies only to tests used for 
poBtsecondary or professional school 
admissions aivd requires test 
publishers to file background reports 
about their tests and provide test 
takers with test information* In 
addition, the New York law requires 
the test agencies to file the contents 
of the tests with the New York 
Commissioner of Education within 30 
days of release of scores, and, 
thereafter, to provide them to test 
takers upon request. 

In addition to these laws, similar 
bills — some requiring total disclosure 
of the test (such as the New York bill 
stipulates), have been filed in 
Florida, Mauryland, Ohio, Texas, 
Colorado, Massachusetts, Pennsylvania 
and New Jersey, although none have, as 
yet, been enacted. Other state bills 
appear to be imminent. Two federal 
bills were introduced in 1979 — the 
"Truth in Testing Act of 1979," known 
as the Gibbons eill or H.R. 3564, and 
the "Educational Testing Act of 1979," 
known as the Weiss Bill, or H.R. 
4949. The former would cover achieve 
ment and occupational tests as well as 
admissions tests, but would not 
require total disclosure; the latter 
would be limited to admissions tests 
but would not require total disclosure. 

All but two of the bills 
introduced apply to postsecondary 
education admissions testing only. 
They do not apply to standardized 
achievement tests used in public 
elementary and secondeury schools, nor 
to personality, diagnostic, or minimal 
competency exams. An exception is the 
Massachusetts Bill which requires 
total disclosure of its competency 
tests. With the exception of the 
Gibbons Bill, these bills muld not 
apply to occupational testing, civil 
service or licensing exeuninations. 
The New Jersey bill, however, would 
apply to all tests "developed by a 
test agency for the purpose of 
selection, placement, classification, 
graduation or any other bonafide 
reason concerning pupils in elementeury 



ERLC 



and secondary, postsecondary or 
professional schools." 

The arguments surrounding test 
disclosure legislation are compounded 
by disagreements about the role and 
power of testing companies and the 
quality of standardized tests used 
primarily for predicting student 
performance. Table 1 summarizes those 
arguments which deal with the issue of 
test disclosure.^ 

Issue 4: What is the meaning of test 
bias? 

Perhaps the most difficult social, 
educational, technical, and legal 
issue facing educators in general and 
measurement specialists in particular, 
is the issue of test bias. Bias is 
such an important issue because it 
arises from our aspirations to achieve 
two highly valued goals. First, we 
have merged from the 1970s with an 
ever growing awareness of the wide 
variety of cultures in our society and 
a desire to accommodate them. Second, 
we face the always present challenge 
of conducting good quality (fair and 
useful) assessment in our schools. 
These goals give rise to the need for 
testing methods that take into account 
cultural and linguistic differences in 
students. 

Meeting both priorities is a 
difficult challenge because we often 
lack the combination of cultural or 
linguistic knowledge and test 
development skills required to do the 

^The information in the table is 
taken from Searching for the Truth in 
"Truth in Testing" Legislation; A 
Background Report. Much of the above 
material has been abstracted from that 
report; those readers who wish to 
pursue the issues outlined are 
encouraged to obtain a copy of this 
publication. The report is available 
from ECS, 1860 Lincoln Street, Denver, 
Colorado 80295. The cost is $6.50 per 
copy. 




TABLE 1 

Debates For and Against Test Disclosure Legislation 



Pro-Legislation Sentiments 

Grade inflation, misuse have cc^blned 
to give tests too much influence in 
admissions decisions^. 

A conunitment to "truth in lending," 
"truth in advertising," sunshine laws 
and consumerism should extend to an 
area as important as admissions 
testing. 

Legislation will promote greater 
accuracy, validity of tests. 

Legislation will encourage use of 
multiple criteria in selection process. 

The admissions test industry is not 
accountable to anyone. 

Students can learn about tests and 
test strategy from examining test 
questions. 

Security need not be an issue; new 
measurement technology could enable 
testers to eliminate the problem. 

Development costs would not increase 
as much as testers suggest. 

Items now available only to expensive 
coaching schools would be available to 
everyone, benefiting poor students. 

There are many solutions to the 
comparability problem; the laws do not 
adversely alffect comparability 
measurement. 

The fairness issue takes precedence 
over technical matters. 

Disclosure will help admissions 
officers as well as students. 



Anti-Legislation Sentiments 

Higher education's need for students 
has lessened importance of admissions' 
test scores. 

Test publishers and higher education 
institutions already provide ample 
information and protection; analogifjG 
to consumer movements are misleading* 

There are several competing public 
interests at stake; critics have not 
established an overriding need for 
legislation. 

Legislation calling for full 
disclosure will lower the quality of 
tests. 

Most institutions already use multiple 
criteria and test agencies encourage 
the practice. 

The industry is accountable £0 the 
psychometric profession, market 
forces, academic community. 

Federal legislation would constitute 
dangerous, if not unconstitutional, 
federal incursion into education. 

Legislation interferes with First 
Amendment right of colleges to 
determine who they want to teach. 



EKLC 



15 

2G 



job. The equation is complex indeed^ . 
On one hand we have an examinee who 
brings to the test a language and set 
of cultural experiences that may 
represent any of hundreds of 
cultures. And, on the other hand, we 
have a test prepared by test makers 
(teachers or test publishers) who must 
make certain assumptions about 
language and cultural patterns in 
order to prepare test items. Claims 
are often made that tests are based on 
the language and culture of white, 
middle-class, suburban children and 
are inherently unfair to students who 
experience other cultural settings. 
Claims of ethnic, cultural, 
socio-economic and sex bias are 
widespread. 

Currently, test publishers and 
educational researchers are devoting 
considerable effort to clarifying the 
definitJons of and reasons for test 
bias, ^:nd to determine how to deal 
with its existence. For instance, in 
1980 a National Symposium of Education- 
al Research sponsored by Johns Hopkins 
University was devoted to the topic of 
test item bias methodology. 



DEFINITIONS 

Although no single technically 
correct definition of test bias 
exists, one which repeatedly appears 
in the writings of researchers and 
publishers is that a test is biased if 
individuals from different groups who 
are equally able, do not have equal 
probabilities of success. For 
excunple, on an achievement test, if 
students in one racial group score 
consistently lower than students from 
another group, and consistently lower 
than would be expected from their 
observed classroom performance, the 
test may be said to be biased against 
that group. Similarly, on a test used 
to select students for college 
admission, if students from one racial 
group score consistently lower than 
students from another group, but the 



performance of the two groups of 
students in the college program is 
comparable, the test may be said to be 
biased against the lower scoring group. 

Several other definitions have 
been suggested. For example, one 
definition is that a test is biased if 
the different groups tested do not 
achieve the same average score on each 
item of the test. Another definition 
holds that a test is biased if two 
groups do not achieve similar total 
test scores. This definition allows 
for differences in performance on 
different items. These definitions 
assume that the groups are alike in 
knowledge of skills measured and any 
differences in performance are due to 
unfair items. These definitions have 
given rise to many public complaints 
of unfairness. However, it is 
critical to keep in mind that given 
our history of discriminatory educa- 
tional practices, differences in 
performance may be caused by factors * 
other than biased test items. 

Another definition does not 
require that groups have the same 
ability or skill, but does require 
that differences hold true for all 
test items. That is, if differences 
are not uniform, it is assumed that 
the test items are measuring different 
things in the various groups. 

Other kinds of bias are not 
inherent in the test but, rather, 
relate to how a test is used. For 
example, bias could be shown to occur 
if a test were used to make a 
selection decision simply because the 
test is correlated with a third 
variable that is relevant to and 
predictive of job performance even 
though the test itself has not been 
established as relevant to job 
performance. The use of a test could 
be biased if it assessed only one 
prerequisite skill and ignored equally 
predictive and important skills for 
which the pattern of group performance 
was noticeably different. 




APPROACHES TO REDUCING TEST BIAS 

It is important to point out that 
there is no clear-cut "solution" to 
the problem of test bias. No 
"culture-free" test has yet been 
devised, nor is the state of the art 
such that one can be developed. The 
best that can be done is for test- 
makers to make vigorous efforts to 
continuously screen tests for potential 
bias, and for test users to be sure 
that test results iare used fairly in 
all cases. 

One approach commonly used to 
avoid test bias is to have a panel of 
persons broadly representative of the . 
various racial, ethnic and sexual 
groups that might be taking the test 
review the test questions. This helps 
ensure that test questions will not be 
biased or that they will not reflect 
only experiences or the culture of a 
particular group. This procedure 
should be undertaken not only when a 
test is first written, but periodi- 
cally thereafter so that changes in 
our culture do not make some questions 
obsolete for some groups. 

Another approach is to carefully 
examine the performance of various 
groups on the test as a whole as well 
as for individual questions. In this 
way, unusual variations in performance 
among the groups can be pinpointed, 
and the test questions reexamined in 
an effort to detect any characteris- 
tics or wording that would seem to 
make them biased towards a particular 
group. For publishers to conduct 
these studies, school districts must 
be willing to provide the demographic 
data necessary to perform the analyses. 

Given the large number of languages 
and cultures in some educational 
environments, this process of careful 
test review and development will 
require significant time, money and 
patience. 



EKLC 



Issue 5: What are the legal issues 
related to IQ testing? 

People have and will probably 
continue to disagree about whether or 
how "intelligence" can be accurately 
and systematically measured. Some 
argue that evidence of intelligence 
can be reduced to a set of tasks which 
can be systematically measured through 
some form of performance or paper and 
pencil test. Others argue that traits 
such as common sense, wit, creativity, 
resourcefulness, ambition, and sensi- 
tivity are all important dimensions of 
intelligence and can never be adequate- 
ly quantified in a test score. 

IQ tests have historically been 
used to attempt to assess a child* s 
aptitude for performance in school. 
These tests are designed to assess 
skills that are perceived to be 
prerequisites to learning skills such 
as verbal reasoning, spatial percep- 
tion, etc. Thus, high scores on the 
tests are often used to place children 
in classes for the gifted. Conversely, 
low scores are often used to place 
children in special education classes 
for the mentally retarded. The most 
commonly used individually adminis- 
tered IQ tests, the Stanford-Binet and 
Wechsler Intelligence Scale for 
Children (WISC) , are forms of 
"performance tests." Children are 
given a set of tasks to perform and 
are judged on the speed and accuracy 
with which they perform them. One 
important assumption behind the tests 
is that "intelligence" is distributed 
in society along a normal curve. This 
means that a small number of people in 
the society will be very bright or 
very dull, and the majority will 
cluster around a point defined as 
average intelligence. 

Since the way in vrtiich IQ test 
scores are used has significant 
consequences for children (e.g., 
placement in classes for the 
retarded) , legal challenges have 
focused both on the nature of the 



if ^ 



teats and the ways in which the 
results are used* The most 
significant legal precedents in IQ 
testing come from a 1979 Federal 
District Court decision in a 
California case ( Larry p, v, Riles y 
No. C71-2270 RPPr N.D. Cal. Decision 
10/16/79) and a 1980 Federal District 
Court decision in an Illinois case 
( Parents in Action on Special 
Education v> Hannon ^ No, 74C3586, N.D. 
111. Decision 7/7/80). 

The Larry P. v. Riles decision 
held that California school officials 
unlawfully discriminated against black 
children by using racially and 
culturally biased tests to classify 
and place them in classes for the 
educable mentally retarded (EMR) . 
Judge Robert F. peckham provides the 
following summary of his 131-page 
opinion. 

This court finds in favor of 
plaintiffs, the class of black 
children who have been or in the 
future will be wrongly placed or 
maintained in special classes for the 
educable mentally retarded, on 
plaintiffs' statutory and state and 
federal constitutional claims. In 
violation of Title VI of the Civil 
Rights Act of 1964, the Rehabilitation 
Act of 1973, and the Education for All 
Handicapped Children Act of 1975, 
defendants have utilized standardized 
intelligence tests that are racially 
and culturally biased, have a 
discriminatory impact against black 
children, and have not been validated 
for the purpose of essentially 
permanent placements of black children 
into educationally dead-end, isolated, 
and stigmatizing classes for the 
so-called educable mentally retarded. 
Further, these federal laws have been 
violated by defendants' general use of 
placement mechanisms that, taken 
together, have not been validated and 
result in a large over-representation 
of black children in the special 
E.M.R. classes. 



"Defendants' conduct additionally 
has violated both state and federal 
constititional guarantees of the equal 
protection of the laws. The 
unjustified toleration of 
disproportionate enrollments of black 
children in 6.N.R. classes, and the 
use of placement mechanisms, 
particularly the I.Q. tests, that 
perpetuate those disproportions, 
provide a sufficient basis for the 
relief under the California 
Constitution. And under the federal 
Constitution, especially as 
interpreted by the Ninth Circuit Court 
of Appeals, it appears that the same 
result is dictated. 

"Moreover, there is another basis 
for the federal constititional 
ruling. Defendants* conduct, in 
connection with the history of i.Q. 
testing and spe(?lal education in 
California, reveals an unlawful 
segregation intent. This intent was 
not necessarily to hurt black 
children, but it was manifested, inter 
alia , in the use of unvalidated and 
racially and culturally biased 
placement critetia. This intent, 
consistent only with an impermissible 
and unsuppor table assumption of higher 
incidence of mental retardation among 
blacks, cannot be allowed in the face 
of the constitutional prohibition of 
racial discrimination." 

Relief granted to plaintiffs 
included an injunction against 
defendants" use of standardized 
intelligence tests for BMR 
identification or placement without 
court approval and an order that 
defendants monitor and eliminate 
disporportionate EMR placement of 
black children. The court decision 
also granted the reevaluation of all 
black children who were placed in EMR 
classes without the use of such tests, 
as well as supplemental education for 
all children found to have been 
misclassif ied. 

The trigger for the Larry: P. v. 
Riles court's legal scrutiny of IQ 



2^ 



tests and test bias was the 
disproportionate number of black 
children placed in EMR classes as a 
result of IQ tests and the serious 
injury of EMR placement to 
misclassif ied children* The court 
found that the EMR classes were 
"conceived of as 'dead-end classes'" 
for children incapable of learning the 
regular curriculum. Children in these 
classes tended to fall further and 
further behind children in regular 
classes since they were provided with 
instruction that deemphasized academic 
skills in favor of adjustment. 
Disproportionate numbers of black 
children had been placed in 
California's EMR classes. For 
example, the evidence showed that in 
the 20 districts accounting for 80 
percent of the enrollment of black 
children in 1976-77, black students 
comprised about 27.5 percent of the 
student population and 62 percent of 
the EMR population. This dispro- 
portion cannot be explained by chance 
since "there is less than one in a 
million chance that the overenrollment 
of black children and the underen- 
rollroent of nonblack children in the 
EMR classes in 1967-77 would have 
resulted under a color-blind system of 
placement." 

Although California law required 
IQ test scores to be "substantiated 
by" other evidence such as adaptive 
behavior (the ability to engage in 
social activities and perform everyday 
tasks) , the court found that the 
"magic of numbers" was strong and that 
the available data suggested very 
strongly that the IQ scores were a 
pervasive influence in the placement 
process. The entire placement process 
often revolved around the demonstra- 
tion of IQ. 

In an introductory discussion of 
intelligence tests subtitled "The 
Impossibility of Measuring 
Intelligence," Judge Peckham noted 
that the expert testimony overwhelm- 
ingly rejected the concept that IQ was 



ERLC 



an objective measure of innate, fixed 
intelligence. 

"Defendants' expert witnesses, 
even those closely affiliated with the 
companies that devise and distribute 
the standardized intelligence tests, 
agreed, with one exception, that we 
cannot truly define, much less 
measure, intelligence — l.Q. tests, 
like other ability tests, essentially 
measure achievement in skills covered 
by the examinations. The fact that 
IQ tests are developed according to 
the plausible but unproven assumption 
that intelligence is distributed in 
the population in accordance with a 
normal statistical curve — cautions us 
to look very carefully at what the 
tests do measure and exactly how they 
were validated for determining mental 
retardation." 

Noting that the disparities in EMR 
placement of black children are also 
reflected historically in black 
performance in general on standardized 
intelligence tests. Judge Peckham 
exeuained three arguments used to 
explain the disparity in IQ scores — 
the genetic argument, the socio- 
economic argument, and cultural bias. 
Judge Peckham rejected the genetic 
argument because defendants were 
unwilling to admit any reliance on it 
for policy-making purposes and because 
the rather weak evidence in support of 
this explanation tends to rest on the 
disparities in the IQ scores, which 
overlooks possible bias in the tests 
themselves. Judge Peckham also 
rejected the socio-economic argument. 
Testimony and studies showed that the 
relatively low scores of black 
children do n9t result from mental 
disease attributable to the physical 
conditions of poverty. School 
performance, however, does vary 
somewhat according to socio-economic 
status. 

On the other hand. Judge Peckham 
found the plaintiffs' evidence of 
racial and cultural bias in the IQ 



19 

24 



tests more persuasive* "The first 
important inferential evidence is that 
the tests were never designed to 
eliminate cultural biases against 
black children; it was assumed, in 
effect, that black children were less 
•intelligent' than whites." He later 
noted: "The tests had been adjusted, 
for example, to eliminate differences 
in the average scores between the 
sexes, but a comparable effort was not 
made and has never been made for black 
and white children." 

The court also found that 
Wechsler*s admission in 1944 (that the 
WISC's standardization was based upon 
white subjects only and that those 
norms cannot be used for the nonwhite 
population of the United States) 
applies with equal force to other 
standardized tests. These problems 
were not solved by the restandard- 
ization of the Stanford-Binet and 
WISC-R intelligence tests. The court 
went on to review a number of indica- 
tors that point to the existence of a 
cultural bias against black children's 
vocabulary and other linguistic 
differences, obviously biased items 
and more subtle kinds of bias involved 
in measuring knowledge of white 
culture. With only one exception, 
there was general agreement by all 
sides oh the inevitable effect of 
cultural differences on IQ scores. 
Put succinctly by Professor Asa 
Hillard, black people have a "cultural 
heritage that represents an experience 
pool which is never used" or tested by 
the standardized IQ tests. 

In analyzing the requirements of 
federal statutory law, the Larry p. v. 
Riles case set legal standards f ot 
validation of IQ tests used for EMR 
placement. Reviewing Title VI of the 
Civil Rights Act of 1964, the Rehabil- 
itation Act of 1973, and the Education 
for All Handicapped Children Act of 
1975 (EHA) , and related case law. 
Judge Peckham concluded that the 
approach used in Title VII employment 
test cases was generally appropriate 
for allocating burden of proof for 



"validation" in the Larry p> v. Riles 
case. Under this procedure, tests 
shown to have a discriminatory impact 
cannot be utilized unless the employer 
is able to show that any given 
requirement has a manifest relation- 
ship to the employment in question. 
Judge Peckham noted, however, that the 
notion of predicting "job performance" 
cannot be effectively translated into 
an educational context given the 
differing purposes of employers and 
schools ; 

"Compulsory attendance of 
educational institutions is required 
by the state, and the schools are 
supposed to take children from 
different backgrounds and teach them 
the skills necessary for adaptation 
and success in our society* This 
points out a fundamental difference 
between the use of tests in employment 
and education, at least in the early 
years of schooling • if tests can 
predict that a person is going to be a 
poor employee, the employer can 
legitimately deny that person a job, 
but if tests suggest that a young 
child is probably going to be a poor 
student, the school cannot on that 
basis alone deny that child the 
opportunity to improve and develop the 
academic skills necessary to success 
in our society. Assignment to E.M.R. 
classes denies that opportunity 
through relegation to a markedly 
inferior, essentially dead-end, track." 

Given this important distinction 
and federal regulations under EHA and 
the Rehabilitation Act requiring that 
tests and other evaluation materials 
be "validated for the specific purpose 
for which they are used," Judge 
Peckham replaced the predictive 
validity required in employment cases 
with an alternative kind of validation: 

"We are not concerned now with 
predictions of performance, but rather 
whether the tests are validated with 
respect to the characteristics 
consistent with E.M.R. status and 



20 



25 



placement in E.M.R. classes. B.M.R. 
classes exist *£or people whose mental 
capabilities make it impossible for 
them to profit from the regular 
educational program » * *Mental 
retardation* is the touchstone, and 
retardation must make it 'impossible' 
to profit from the regular classes, 
even with r^edial instruction. 
Defendents have the burden of showing 
validation of intelligence tests with 
respect to these characteristics." 

In Parents in Action on Special 
Education v. Hannon f the presiding 
judge, Judge Grady, focused sheurply on 
whether the IQ tests in question 
(Wise, WISC-R, and S tanford-Binet) 
are, in themselves, racially biased, 
and whether use of the tests as a part 
of the statute-mandated criteria for 
placement in classes of the "educable 
mentally handicapped" is racially 
discriminatory. In summary, the 
opinion concluded that: 

(1) Only one item on the 
Stanford-Binet and a total of eight 
items on the WISC and WISC-R are 
culturally biased against black 
children, or at least sufficiently 
suspect that their use is 
inappropriate. These few items do not 
render the tests unfair and would not 
significantly affect the score of an 
individual taking the test. 

(2) When used in conjunction with 
other statute-mandated criteria for 
determining an approprite educational 
program for a child, these tests do 
not discriminate against black 
children in the Chicago schools. 

In contrast to the Larry P. v. 
Riles decision. Judge Grady never 
reached the question of appropriate 
legal standards for evaluating 
compliance with federal law. Instead, 
Grady presented em exhaustive, item by 
item analysis of questions included in 
the three tests » found an insignifi- 
cant number to be biased, and refused 
to enjoin Chicago's use of the tests 
as a part of the placement process. 

The opinions in each of these 
Cases are readable and informative. 



Readers interested in more detail and 
background on the opinions are 
encouraged to obtain and review copies 
of the opinions from the respective 
District Courts. 

It is difficult to predict what 
will follow in the wake of these two 
opinions. While Judge Peckham in 
Larry P. v. Riles accepted the 
contention that IQ tests were biased, 
Judge Grady in Parents in Action v. 
Hannon rejected this allegation. 
Undoubtedly, further litigation will 
follow. The California Department of 
Education has already announced plans 
to appeal Larry P. v. Riles . 

It is likely that the legal 
controversy over use of traditional IQ 
tests will spur research efforts to 
develop so-called "non-discriminatory" 
assessment batteries whose results 
will more accurately reflect the 
potential of minority children. One 
example of such a battery is the 
"System of Multicultural Pluralistic 
Assessment," known as SOMPA. SOMPA 
was developed by a sociologist at the 
University of California, Riverside, 
and is designed to provide a far 
broader picture of a child's potential 
based on a careful examination of the 
child's social and cultural background 
and experiences. It is unlikely that 
"alternative" IQ measures which are 
acceptable to critics of IQ tests will 
be developed and validated quickly. 

Issue 6: What are the educational 
and legal issues surrounding minimum 
competency testing? 

The fundamental purpose behind 
minimum competency testing is to 
determine whether students have 
acquired sufficient proficiency in 
certain basic and/or life skills to 
cope with the adult world. Two types 
of tests exist, tests that measure the 
basic academic skills of reading, 
writing and computation, and tests 
measuring "life skills" on topics such 



26 



as consumer awareness # health, 
citizenship, balancing a checkbook or 
applying £or a bank loan. 

In some states, the same test is 
given statevd.de, whereas in other 
states each district designs and 
administers its own test based on 
locally determined competence. 

A 1979 study sponsored by the 
National Institute o£ Education 
investigated 31 state and 20 local 
district competency testing programs 
in the United States. An executive 
summary o£ that study states: 

"Sixteen of the 31 state-level 
programs were mandated by the State 
Board o£ Education, and 15 were 
initiated by the state legislature. 
Two of the legislated mandates call 
for temporeury programs; one State 
Board-initiated program and one 
legislated program permit voluntary 
participation of local school 
districts. Two other states emphasize 
the competency-based instructional 
aspects of their programs rather than 
the testing components. 

"Of the 20 local programs studied, 
five developed in states without 
statewide requirements for minimum 
competency testing. Of the remaining 
15 districts, eight began instituting 
minimum competency testing progreuns 
prior to state mandates, while seven 
districts implemented programs in 
response to such mandates. 

"The majority of programs, both 
state and local, were developed in the 
two to three years since 1976, but the 
age of programs ranged from 18 years 
to less than one year with ongoing 
pilot- testing. Fourteen state 
programs have been fully implemented, 
while 17 are being phased in. For 
example, many state programs are 
introducing new graduation 
requirements or curriculum changes 
over a period of years and hence, 
these programs will not be "in place" 
until some time in the future. By 
comparison, 13 of the 20 local 
programs have already been fully 



implemented, while seven programs are 
phasing in mandated changes. 

"Programs in only four states have 
had litigation associated with them in 
any way — Delaware, Florida, Maryland, 
and North Carolina — and the majority 
of this activity has occurred in 
Florida. 

"With respect to goals and 
purposes, 14 states cited certifi- 
cation of basic skills competency 
prior to high school graduation as a 
major purpose, and two states reported 
using competency achievement as one 
criterion for grade- to- grade prcxnotion 
as a reason for implementing a minimum 
competency testing program. The most 
frec[uently cited purpose for 
instituting such a program was to 
identify students in need of 
remediation; 19 states reported this 
purpose. Curriculum improvement was 
mentioned by 10 states as a major 
program goal. By comparison, 16 local 
districts reported certification of 
basic skills as one reason for 
developing a minimum competency 
testing program; four districts cited 
the use of test results, along with 
other information, to determine 
grade-to-grade promotion as a major 
purpose of the program. Eleven 
programs reported purposes related to 
providing remediation and seven 
districts mention curriculum change as 
a major purpose behind program 
implenentation. 

"Reading and mathematics were 
competency areas assessed in all state 
and local programs. Twenty- seven of 
the state programs assessed skills in 
language arts and/or writing, while IS 
local districts assess these same 
skills. Skills in other subject 
areas, such as speaking, listening, 
consumer economics, science, 
government, and history, are assessed 
in only a few programs. Almost all of 
the tests administered in both state 
and local programs consist primarily 
of multiple-choice items, and a 
writing sample is the most frequently 



selected non-multi pie-choice 
assessment* "3 



LEGAL ISSUES 

Memy o£ the legal issues involved 
in competency testing are inextricably 
linked to issues o£ test quality and 
the quality o£ educational progreuDS 
designed to suj^ort competency 
testing. For example, the nature and 
quality of a competency test may 
trigger legal challenge, but test 
quality is and should be in itself an 
educational issue. Similarly, 
insuring quality and effectiveness in 
basic and remedial instructional 
progreuns is one of the central 
missions of education* Nevertheless, 
in examining minimum competency 
progreuns, courts are likely to closely 
examine these instructional 
activities* While it seems impossible 
to clearly disentangle "legal" from 
"educational" issues in minimum 
competency testing, it is useful to 
review the issues courts have examined 
to date* 

The distinction bet%#een using a 
competency test only as a diagnostic 
tool to identify student weaknesses in 
basic skills and tying high school 
graduation to successful performance 
on the test, is crucial in examining 
the legal implications of minimum 
competency testing* The legality of a 
testing program will usually depend 
more on how the test results are used 
than on the nature of the test 
itself* For example, as McClung 
points out in a legal rrjvlew of 
competency testing: 



■^Gorth, W*P*, and Perkins, M*R* , A 
Study of Minimum Competency Testing 
Programs; Final Summary and Analysis 
Report* Amherst, MA: National 
Evaluation Systems, Inc*, December 
1979. 



EKLC 



"Using the test results as the 
primary basis for any decision that 
will cause serious harm to a student 
raises the Initial legal questions* 
The trigger I for legal analysis is this 
injury* Assuniing there is injury, the 
following questions arise: Who is 
responsible for that injury and does 
that person or agency have sufficient 
justification for causing that injury? 

"If there is no injury, then there 
is no legal problem* Competency tests 
can be used in many ways that cause no 
injury to a student* For example, 
competency tests could be used simply 
to determine the general level of 
student performance in basic skills on 
a statewide or district level; to 
identify basic skill areas In an 
instruction program that need more 
emphasis; or to diagnose areas in 
which an individual student needs 
specific help* In such cases, there 
is usually no injury and no legal 
problem* 

"On the other hand, competency 
tests can be used to make decisions 
about individual students that have 
potential for grave injury* For 
example, competency tests C'?ji be used 
for tracking, grade promotion, or 
denial of a regular high school 
diploma* Diploma denial, as mandated 
in Florida and California, probably 
causes the greatest injury to an 
individual student, and therefore 
raises the most serious legal 
questions (p* 657-658) *"4 

Minimum competency testing 
requirements that Incorporate some 
sanction upon students for failing to 
pass the tests run the greatest risk 
of legal challenge* These legal 
challenges are most likely to be 
raised if competency testing programs 
touch on any of the following issues: 

^McClung, M*S*, "Competency testing 
programs: Legal and educational 
issues," Fordham Law Review, 47 , 
1979, 651-711* 



23 



• Potential for racial and 
linguistic discrimination 

• Adequacy of advance notice and 
phase-in periods prior to the 
initial use of the test as a 
graduation requirement 

• Psychometric validity or 
reliability of the tests 

• Match between the instructional 
program and the test 

• The degree to which remedial 
instruction may create or 
reinforce tracking 

1, Potential for racial and 
linguistic discrimination . Briefly 
stated, some states and many local 
school districts in the past have been 
found to have discriminated against 
racial and linguistic minority 
students in violation of the equal 
protection clause of the U.S. 
Constitution and Title VI of the Civil 
Rights Act of 1964. Examples of such 
states and districts includa those 
that have been held by courts to have 
operated "dual school systems" for 
blacks and whites and who have been 
ordered to desegregate, and those that 
have been found not to be providing 
adequate bilingual instruction in 
accord with the U.S. Supreme Court's 
ruling Lau v. Nichols . In states or 
districts which have been subject to 
or are vulnerable to such findings, 
the effect of minimum competency 
testing requirements may be to 
reinforce the effects of prior 
discrimination. That is, the minimum 
competency testing sanction could pile 
one injury (diploma denial) on top of 
another (prior denial of equal 
educational opportunity) • 

2. Adequacy of advance notice and 
phase-in periods prior to the initial 
use of the test as a graduation 
requirement . Legal concerns for 
fairness and due process will require 
extensive notice of minimum competency 



testing requirements to students and 
parents. For example, the first class 
of students subject to a minimum 
competency testing requirement might 
not know that passing a competency 
test will be a condition for acquiring 
a diploma. The school district, in 
fact, would have explicitly approved 
students' progress by promoting them 
each year even though many of them 
lacked basic skill proficiencies. It 
is also likely that many, if not most, 
of those students failing the test 
might have studied differently and 
teachers taught differently had they 
received advance notice of the 
requirement. 

Procedures for notifying students 
vary from school to school. In most 
districts students are first given 
general notice of the proficiency 
requirement for a diploma and then at 
a later date notified of the specific 
performance objectives to be measured 
by the proficiency test. Students, 
parents and teachers should be given 
notice of both performance objectives 
and assessment procedures as soon 
after their adoption as possible. 

Traditional notions of due process 
require adequate prior notice of any 
rule that could cause irreparable harm 
to a person* s educational or occupa- 
tional prospects. Notification of 
requirements after conpleting most of 
one's educational program may be 
viewed as both unfair and inadequate, 
especially if the minimum competency 
test is designed to measure knowledge 
and skills not previously taught in 
the district's classrooms. 

3. Psychometric validity or 
reliability of minimum competency 
tests . All tests ought to meet 
reasonable professional psychometric 
standards of validity and reliability. 
Simply stated, validity refers to 
whether or not a test measures what it 
purports to measure, and reliability 
refers to whether or not the test 
measures student performance 
accurately from one test adminis- 
tration to another. The most widely 



2G 

24 



accepted professional test development 
standards are the Standards for 
Educational and Psychological Tests ^ 
published by the American Psycholo- 
gical Association. It is likely that 
minimum competency tests will be 
subjected to careful scrutiny against 
such benchraeurks as the Standards * 

4. Match between the instruc- 
tional program and the test . Most 
persons would agree that fairness 
requires that a school's curriculum 
and instruction be matched to the 
competencies measured by a test. In 
other words, the test would be unfair 
if it attempted to measure what the 
school did not teach. This concept 
should be considered in terms of both 
curriculur validity and instructional 
validity. 

Curricular validity is a measure 
of how well test items match the 
objectives of the curriculum. An 
analysis of curricular validity would 
require comparison of the test 
objectives with the school's stated 
course objectives. This becomes 
important, for example, if the 
curriculum Is not specifically 
designed to teach functional 
competency and the use of a test 
covering functional competency is 
considered. It might be unfair to 
deny students their diplomas because 
they did not learn these functional 
competencies, in such a situation, 
failure on the minimum competency test 
might indicate that the school did not 
offer an appropriate curriculum. 

A minimum competency test should 
also have vAiat may be called 
instructional validity. Even if the 
curricular objectives of the school 
correspond to those of the competency 
test, there might be a discrepancy 
between the stated objectives of the 
school and what is actually being 
taught in the classroom. Instruc- 
tional validity obviously does not 
require prior exposure of the student 
to the exact questions asked on the 
test, but it does require exposure to 
the kind of knowledge and skills that 



would enable a student to answer the 
test questions. 

It is important to note that 
content validity does not ensure 
either curricular or instructional 
validity. They are related, but 
distinguishable concepts. Content 
validity is a measure of how well test 
items represent the body of skills and 
knowledge that the test purports to 
measure but is not necessarily a 
measure of how well the test items 
represent either a school's curricular 
objectives or instruction. Instruc- 
tional validity should be the central 
concern because content and curricular 
validity mean very little if the test 
items are not representative of 
instruction actually received by the 
student. 

5. The degree to which remedial 
instruction may create or reinforce 
tracking . Most minimum competency 
testing programs implicitly or explic- 
itly require remedial instruction for 
students found to be deficient in 
basic skills. In districts subject to 
findings of prior racial or linguistic 
discrimination as described above, one 
effect of minimum competency testing 
requirements may be to inappropriately 
channel or "track" disproportionate 
numbers of minority students into 
remedial programs on the basis of 
their test results. This could have 
the effect of "resegregating" students 
Into remedial programs in direct 
contradiction to prior orders to 
desegregate school systems. 



THE DEBRA P. v. TURLINGTON DECISION 

To date, the only major legal 
challenge to competency testing was 
mounted in Florida. In Debra P. v. 
Turlington , a group of black student 
plaintiffs sued the state in Federal 
Distict Court to have the state's 
competency testing progreim ruled 
unconstitutional. Plaintiffs 
challenged the test on each of the 
grounds mentioned above. 




25 



In July 1979 r the court held that 
Florida's competency testing program 
did not give all students adequate 
notice of the inclusion of the 
competency test as a graduation 
requirement,, .and that the competency 
testing program carried forward the 
effects of prior racial discrimination 
in violation of the due process and 
equal protection clauses of the 
Fourteenth Amendment of the U.S. 
Constitution, Title VI of the Civil 
Rights Act of 1964, and the Equal 
Educational Opportunities Act of 
1974. As a remedy, the court enjoined 
Florida from using the test as a 
diploma requirement for four years, 
until the 1982-83 school year. The 
court did not, however, deny use of 
the cconpetency test during this 
four->year period for assessing the 
effects of instruction. 

Although the court found 
psychometric deficiencies in Florida's 
test, it did not find these deficien- 
cies to be unconstitutional. The 
court did not address in any depth the 
issue of the correlation between the 
test and instructional program. 

Issue 7: Are tests being used to 
evaluate teachers in schools? 

Tests are being used to evaluate 
teachers in a variety of ways. But 
tests are never used as the sole 
criterion of teacher evaluation 
because of the complexity of the 
learning process. Since many factors 
influence learning, some under teacher 
control and some not, teacher 
evaluation must be done very carefully. 

The types of test scores that can 
play a role in teacher evaluation are 
the achievement test scores of 
students, test scores of licensing 
examinations, and the scores of tests 
used in the teacher selection and 
hiring processes. 

The evaluation of teachers by 
using the achievement test scores of 



the students they teach is a very 
delicate process. If a group of 
students who have previously shown 
patterns of growth in test scores do 
not grow over an extended period of 
time, and this E^enomenon is apparent 
in the test scores of all or nearly 
all students in the group with the 
same teacher, then those test scores 
can be combined with other information 
about the teacher as part of the 
teacher evaluation process. However, 
if test scores of students are to be 
used in this way, they must be used 
very carefully and with full awareness 
of the potential difficulties with 
this evaluation strategy. 

The first difficulty is that 
factors apart from the school 
experience can greatly influence 
student achievement. Since teachers 
have no control over many of these 
factors they cannot be held account- 
able. For example, characteristics 
such as the child's ability to learn, 
and the child's motivation, are not 
totally within the teacher's control. 
The student's home environment also 
exerts great influence on learning. 
In fact, some research suggests that 
some non- school factors may far 
outweigh school factors in determin- 
ing achievement. When these factors 
begin to interact with the various 
characteristics of the school learning 
environment, it becomes difficult to 
sort put the component of learning 
that is influenced by the teacher and 
the components that are influenced by 
non-school conditions. 

The second difficulty with using 
student test scores to evaluate 
teacher performance is the complexity 
of the desired end product. In 
school, teachers endeavor to help the 
child to gain knowledge and skills in 
many academic areas, some common to 
all students, some unique to an 
individual student. In addition, 
teachers attempt to develop values, 
attitudes and interpersonal skills 
that will benefit a student in 
society. Given all of these desired 



26 



traits along with the cxxnplexity and 
uniqueness of each individual student, 
it becomes impossible to define the 
characteristics of the "desired" end 
product to evaluate. 

Even when it is possible to define 
the citizen we want our schools to 
produce, we have great difficulty 
reflecting many of the important 
characteristics in reliable and valid 
test scores. Though we Ccui use tests 
to document some of the basic 
achievement areas, the focus of these 
tests is very broad and general and 
may not reflect the important 
educational objectives in a given 
school district, building, or 
classroom. Furthermore, other desired 
outcomes, such as attitudes, values 
and interpersonal skills are 
Inherently complex and not easily 
measured in an objective way in school 
settings or otherwise. 

The third potential difficulty 
with using student test scores for 
teacher evaluation is that learning 
does not take place at a steady and 
predictable rate. Even if we could 
define and measure the end product of 
schooling and control most of the 
factors that influence that product, 
we could not assume that every child 
would gain new knowledge and skills at 
the same pace. Some would learn 
faster than others. Some would grow 
slowly then spurt ahead — all according 
to the nature of human development. 
This fact must be taken into account 
in evaluating teacher performance via 
student test scores. 

State licensing examinations are 
also used as a form of teacher 
evaluation. Though most states issue 
licenses on the basis of the 
completion of specified college 
courses or degrees, some also include 
an examination as part of the 
credentialllng process. 

In the field of education, tests 
have been in use for decades for 
certifying teacher competence. The 
State of South Carolina, for example, 
has used the National Teacher Examina- 



EKLC 



tlons (NTE) to certify teachers since 
1945. The Education Commission of the 
States S has developed an excellent 
summary of the current status of such 
testing. 

The National Teacher Examinations, 
which are published and administered 
by the Educational Testing Service, 
include examinations covering academic 
preparation in professional education 
and general education (writing, 
science, math, social studies, 
literature) as well as academic 
preparation in 26 subject-field 
specializations. The tests typically 
focus on the recall of factual 
Information with some use of higher 
order mental operations tests as well. 

In the fall of 1977, four states 
required or recommended use of NTE 
results for initial certification 
purposes. These states were 
Mississippi, North Carolina, South 
Carolina and West Virginia. Louisiana 
was added to this list in 1978. In 
addition to these five states, at 
least 23 states used the NTE for 
special purposes, ranging from 
obtaining statewide data for teacher 
education studies (Alabama) to 
validating credits earned at 
nonaccredlted institutions (California, 
Delaware). In June 1978, the Florida 
Legislature passed a bill requiring, 
in part, a test of teaching competency 
and subject matter mastery for initial 
certification. Working steadily over 
a period of four to five years, the 
Georgia State Department of Education 
developed test Instruments for a 
"Performance Based Teacher Certifi- 
cation" program, and first adminis- 
tered the test in November 1978. In 
early 1979, hearings were held in 
North Carolina on plans for a "Quality 
Assurance Program for Professional 
Personnel" in which testing for 



-^Vlaanderen, R. "Trends in 
competency-based teacher 
certification." Denver, CO: 
Education Commission of the States, 
March 1980. 



32 



teaching cc^mpetencles and subject 
matter mastery plays a major role in 
the certification process. The 
program was £idopted in the fall of 
1979. 

In 1979, se-zeral state legisla- 
tures introducer" bills embodying the 
testing concept in teacher 
certification. In /Arkansas, a bill 
was passed in record time, while 
similar bills in Colorado, Kansas, 
Arizona, Missouri and Vermont died in 
committee. Bills were introduced in 
Alabama, Iowa and Oklahoma in 1980 and 
again, in a special session, in 
Arizona. State Boc;rd action has 
mandated testing in Alabama and 
Tennessee. 

Test scores are also used, in some 
instances, when sever^^l teachers are 
being considered for a limited number 
of teaching positions. The employers 
may use a test as part of the 
selection process. In this case, all 
teachers may be certified, but another 
test might be used to determine 
knowledge of subject matter and/or 
ability to perform in a certain 
educational environment. As in the 
other instances, test scores should 
never be the only criteria considered 
in the selection process. Bat they 
can be a valuable selection aid when 
used carefully with other performance 
information. 



Annotated Bibliography 



Teat Purposes and Users 

Anderson, B,L, , Stiggins, R.J., and Hiscox, S.B. Guidelines for selecting 
basic skills and life skills tests , Portland, OR: Northwest 
Regional Educational Laboratory, 1980. 

This short guide designed for teachers and administrators 
discusses test purposes and characteristics to consider when 
selecting tests. Lists of currently available basic skills and 
life skills tests are provided along with the names and 
addresses of test publishers. 

Brown, F.G. Guidelines for test use: A commentary on the standard for 
educational and psychological tests . Washington, D»C.: 
National Council on Measurement in Education, 1980 

This book is designed for teachers, counselors, school 
psychologists, administrators, parents and others concerned with 
educational measurement. It is a nontechnical explanation of 
the Standards . 

Burrill, L.E. How a standardized achievement test is built, test service 
notebook 125 . New York, NY: The Psychological Corporation. 

The steps described are typical of the way tests are built by 
many major test publishers. Other short articles on related 
topics are available from The Psychological Corporation, New 
York, NY 10017. 

Peder, B. The complete guide to taking tests . Englewood Cliffs, NJ: 
Prentice-Hall, Inc., 1979. 

This book is written for test takers who want to take some of 
the mystery out of testing. 

Parents and testing . Washington, D.C: National Education Association, 
1979, 

This guide provides parents with information on how they should 
and can be involved with schools' testing programs. It also 
gives the NEA position on student testing. 

Rebell, M.A., and Block, A.R. Competence assessment and the courts; 

An overview of the state of the law . Boston, MA: McBer, 1980. 

This study looks at the implication of legal cases on a wide 
variety of educational testing situations, including 
certification, IQ tests, ability tracking, and graduate school 
admissions tests. 



29 



34 



Teachers and testing , Washington, D.C.: National Education Association, 
1979. 



Teachers are provided with general informaton on how and why 
tests are used as well as their strengths and weaknesses. The 
NEA resolutions relating to testing issues are given. 



Achievement Test Score Decline 

Munday, L. Declining admissions test scores. ACT Research, Report #71, 
Iowa City, lA: The American College Testing Program, 1976. 

Several indices of declining academic achievement are 
summarized. However, the principle focus is on declining ACT 
Assessment Program test scores. Correlates of score decline are 
identified and potential explanations are explored. 

tiarnlschfeger , A., and Wiley, D.E. Achievement test score decline; po 
we need to worry ? Monograph of CEMREL, Inc., 3120 59th Street, 
St. Louis, MI 63139, 1976. 

This 160-page monograph reviews several potential explanations 
for declining academic achievement test scores. Data are 
presented in association with the potential explanations 
presented and conclusions are drawn regarding each explanation. 
An excellent summary of conclusions is presented. 

College Entrance Examination Board. On further examination . New York, 
NY: 1977. 

This monograph reports the results of the deliberations of the 
CEEB Advisory Panel on the Scholastic Aptitude Test Score 
Decline. Potential explanations related to school and nonschool 
factors are examined and accepted or rejected as viable. 
Conclusions are presented regarding multiple causes. 



Truth in Testing 

Brown, R. Searching for the truth in "truth in testing" legislation; 

A background report . Denver, CO; Education Commission of the 
States, 1980. 

This is a readable summary of the background and current issues 
in truth in testing. It also summarizes relevant pending 
federal and state legislation. 

Brown, R. Searching for the truth about truth in testing. Compact , 
Winter, 1980, 7-11. 

This article is a much abbreviated summary of the issues 
presented in the background report listed above. 



30 35 



Nairn, A« and Associates. The reign of ETS: The corporation makes up 
minds. Washington, D.Cr 1980. 

The Nairn report on ETS was sponsored by R^lph Nader and offers 
a strong indictment of many of ETS* practices. 

Educational Testing Service. Test scores and family income . Princeton, 
NJ, February 1980. 

Educational Testing Service. Test use and validity , Princeton, NJ, 
February 1980. 

The two ETS reports were developed in response to the Nairn 
report. 



Cultural Bias 



Burrill, L.E. and Wilson, R. Fairness and the matter of bias; Test 
service notebook 36 . New York, NY: The Psychological 
Corporation, 1980. 

This article succintly covers major issues in facial bias, item 
bias and bias in selection and prediction. 

Burrill, L.E. Statistical evidence of potential bias in items and tests 
assessing current educational status . Paper presented at the 
Fourteenth Annual Southeastern Conference on Measurement in 
Education, 1975. 



This paper describes various definitions and interpretations of 
bias and provides a useful reference list. 

Sheppard, L. , Camilli, G., and Averill, M. Comparison of six procedures 
^ for detecting test item bias using internal and external ability 

criteria . A paper presented to the National Council on 
Measurement in Education Annual Meeting, Boston, 1980. 

This paper not only provides a thorough comparison of procedures 
for detecting test item bias, but also contains an extensive 
reference list to the literature on test item bias. 



IQ Testing 

Larry P. v. Riles , No. C71-2270 RFP, N.D. Cal. Decision 10/16/79. 

Readers who are interested in pursuing the issues raised in the 
Larry P. decision are urged to obtain a transcript of the 
decision and read it in its entirety. The decision is readable, 
to the point, and appropriate for a lay reader. 



Parents in Aqtion on Special Education vs. Uannon . No, 74C35Q6i N,D. 
Ill, , Decision 

This transcript of the Parents in Action case provides a 
detailed item by item analysis of the IQ tests in question. 

Notes on Larry P. Footnotes (Newsletter of the Law and Education 

Center, Education Commission of the States, Denver, CO) Vol, 1, 
No. 2, Spring 1980. 

This newsletter presents a short, readable analysis of the Larry 
P. V. Riles case. 



Minimum Competency Testing 

Bunda, M.A. , and Sanders, J.R. (Eds.) Practices and problems in 

competency based measurement . Washington, D.C.; National 
Council on Measurement in Education, 1979. 

This 144 page book provides articles on the key issues in 
competency based testing. 

Debra P. v. Turlington. Footnotes (Newsletter of the Law and 

Education Center, Education Commission of the States, Denver, 
CO) Vol. 1, No. 1, November 1979. 

This newsletter provides a short readable review of the key 
issues in the Debra P. v. Turlington case. 

Gorth, w.P. and Perkins, M.R. A study of minimum competency testing 
programs: Final summary and analysis report . Amherst, MA: 
National Evaluation Systems, 1979. 

This report summarizes the current status of the implementation 
of minimum competency testing across the country. 

McClung, M.S. Competency testing programs: Legal and educational issues. 
Fordham Law Review , 1979, £7, 651-711. 

This article is an exhaustive review of legal issues which 
incorporates potential implications of the Debra P. vs. 
Tur ling ton decision. 

Shoemaker, J.S. Minimum competency testing: Implications for 
instruction . Washington, D.C. : National Institute of 
Education, January 1979. 

This paper presents a discussion of design considerations in the 
development of minimum competency testing programs that will 
maximize the utility of the program for instructional uses. 




32 



37 



RO0«wAtQrf At Minimum oompatanay teattng programB and handioapped 

atudentai Perapeotivea on polioy and praotiaa , Waahington, 
D.C.J George Waahington Unlveraity inatituto for Educational 
Leaderahip, 1979. 

Thia paper preaenta a review ot policy and practical problems 
involved in implementing minimum competency testing programs for 
the handicapped. 



Teacher Testing and Evaluation 



The Psychological Corporation. Summaries of court decisions on employment 
testing, 1968-1977 . New York, NY, 1978. 

This book summarizes court decisions on employment testing in 
both the private and public sector, it is not limited to 
educational personnel. 

Vlaanderen, R. Trends in competency based teacher certification . 

Denver, CO: Education Cominission of the States, March 1980. 

This paper presents a summary of the current status of teacher 
competency testing. 




33 



38 



Appendix A 

A Glossary of Measurement Terms 



The following glossary Is used with the permission of 
the Psychological Corporation, New York, NX 10017 

Similar glossaries may be obtained from other major 
test publishers. 

39 



JestSefviceD^tebool^ 3 



\ Glossary of Measurement Terms 

ILYTHB C. MITCHIU. Consultant, Ttnt Department 



This glossary of terms used in educational and psychologi* 
al measurement is primarily for persons with limited training 
n measurement, rather than for the specialist. The terms de* 
Ined are the more common or basic ones such as occur in 
est manuals and educational journals. In the definitions, cer- 
ain technicalities and niceties of usage have been Nacrificcd 
br the sake of brevity and, it is hoped, clarity. 

The definitions are based on the usage of the various terms 
IS given in the current textbooks in educational and psycho* 
ogical measurement and statistics, and in certain specialized 
lietionaries. Where there is not complete uniformity among 
writers in the measurement field with respect to the meaning 
if a term, either these variations are noted or the definition 
offered is the one that the writer judges to represent the 
best** usage. 

icadclnlc'flplKiHlt. The combination of native and acquired 
ibilitiei that are needed for school learning; likelihood of 
uccesA in miaaterlng academic work, as estimated from mcas- 
Ires cf the necessary abilities. (Also called scholastic aptitude, 
chaol learning abiliiy, academic potential) 

tchfiiremciit teat A test that measures the extent to which a 
^son has "achieved" something, acquired certatn informa- 
ion, or mastered certain skills — usually as a result of planned 
nstruction or training. 

ife norms. Originally, values representing typical or average 
ferformance for persons of various age groups; most current 
isage refers to sets of complete score interpretive data for 
impropriate successive age groups. Such norms are generally 
ised in the interpretation of mental ability test scores. 

iHcnuite^onii rcllaMlity. The closeness of correspondence. 
V correlation, between results on alternate (i.e., equivalent or 
larallel) forms of a test; thus, a measure of the extent to which 
he two forms are consistent or reliable in measuring what* 
iver they do measure. The time interval between the two test- 
ngs must t>e relatively short so that the examinees themselves 
re unchanged in the ability being measured. See reliability, 
;eliability coefficient. 

inccdotal record. A written description of an incident in an 
ndlvidual's behavior that is reported objectively and is con- 
idered significant for the understanding of the individual. 



aptitude, A combination of abilities and other charactc.ri3tics, 
whether native or acquired, that are indicative of an individ- 
ual's ability to learn or to develop proflcicncy in some par- 
ticular area if appropriate education or training is provided. 
Aptitude tests include those of general academic ability (com- 
monly called mental ability or intelligence tests); those of 
special abilities, such as verbal, numerical, mechanical, or 
musical; tests assessing **rendiness** for learning; and prognos- 
tic tests, which measure both ability and previous learning, 
and are u^d to predict future performance - usually in a 
specific field, such as foreign language, shorthand, or nursing. 

Some would define **aptitude** in a more comprehensive 
sense. Thus, **musical aptitude** would refer to the combina> 
tion not only of physical and mental characteristics but also 
of motivational factors, interest, and conceivably other char- 
acteristics, which are conducive to acquiring proficiency in 
the musical field. ^ ' 

arithmetic mean, A kind of average usually referred , to as 
the mean. It is obtained by dividing the sum of a ^et of scores 
by their number. 

average, A general term applied to the various measures of 
central tendency. The three most widely used averages are 
the arithmetic mean (mean), the median, and this mode. When 
the term "average** is used without designation as to type, 
the most likely assumption is that it is the arithmetic hxean^ 

battery. A group of several tests standardized on the same 
sample population so that results on the several tests are com- 
parable. (Sometimes loosely applied to any group of tests 
administered together, even though not standardized on the 
same subjects.) The most common test batteries are those of 
school achievement, which include subtests in the separate 
learning areas. 

blvariate chart (blvarlate distribution). A diagram in which a 
tally mark is made to show the scores of one individual on 
two variiilyles. The intersection of lines determined by the 
horizontal and vertical scales form cells in which the tallies 
are placed. .Such a plot provides frequencies for the two dis- 
tributions, and portrays the relation between the two variables 
as a basis for computation of the product-moment correlation 
coefHcient. 



issued by The Psychological Corporation 

O 37 40 

ERLC 



ccilins. The upper limit of ability that can be measure by ^ 
test. When an individual makes a score which is at or ney^tbe 
highest possible score, it is said that the test has tod^bw a 
**cetltng*' for htm; he should be given a higher level of the test. 

central tendency. A measure of central tendency provides a 
single most typical score as representative of a group of scores; 
the "trend'* of a group of measures as indicated by sonic type 
of average, usually the mean or the median. 

coefficient of correlation. A measure of the degree of rela- 
tionship or "going-togetherness** between two sets of meas- 
ures for the same group of individuals. The correlation co- 
efficient most frequently used in test development and educa- 
tional research is that known as the Pearson or product-mo- 
ment r. Unless otherwise specified, "correlation" usually refers 
to this coefficient, but rank, biseriaL tetrachoric, and other 
methods are used in special situations. Correlation coefficients 
range from .00, denoting a complete absence of relationship, 
to +1.00, and to —1.00, indicating perfect positive or perfect 
negative correspondence, respectively. See correlation. 

composite score. A score which combines several scores, 
usually by addition; often different weights are applied to the 
contributing scores to increase or decrease their importance 
in the composite. Most commonly, such scores are used for 
predictive purposes and the several weights arc derived through 
multiple regression procedures. 

concurrent validity. See validity (2). 

construct validity. See validity ( 3 ) . 

content validity. See validity ( 1 ) . 

correction for guessing (correction for chance). A reduction in 
score for wrong answers, sometimes applied in scoring true- 
false or multiple-choice questions. Such scoring formulas 
(R-W for tests with 2-option response, R — l/aW for 3 
options, R — V^W for 4, etc.) are intended to discourage 
guessing and to yield more accurate rankings of examinees in 
terms of their true knowledge. They are used much less today 
than in the early days of testing. 

correlation. Relationship or "going-togetherness" between two 
sets of scores or measures; tendency of one score to vary con- 
comitantly with the other, as the tendency of students of high 
IQ to be above average in reading ability. The existence of a 
strong relationship — i.e., a high correlation - between two 
variables does not necessarily indicate that one has any causa! 
influence on the other. See coefficient of correlation. 

criterion. A standard by which a test may be judged or eval- 
uated; a set of scores, ratings, etc., that a test is designed to 
measure, to predict, or to correlate with. See VALmrnc. 

criterlon>referenced (content-referenced) test. Terms often used 
to describe tests designed to provide information on the spe- 
cific knowledge or skills possessed by a student. Such tests 
usually cover relatively small units of content and arc closely 
related to instruction. Their scores have meaning in terms of 
what the student knows or can do, rather than in their relation 
to the scores made by some external reference group. 



crilerion-related validity. Sec validity (2). 

* • ' ■ ^ ■ 1 \ ' • ': 
^ culllire<*falr tesi. So-called culture-f^ tests attempt to provide 
iin equal opportunity for success by persons of all cultures and 
life experiences. Their content must therefore be limited to 
that which is equally common to all cultures, or to material 
that is entirely unfamiliar and novel for all persons whatever 
their ciiltural background. See culturf-kreu ti-st. 

culture-free lest. A test that is free of the impact of all cultural 
experiences; therefore, a measure reflecting only hereditary 
abilities. Since culture permeates all of man's environmental 
contacts, the construction of such a test would seem to be an 
impossibility. Cultural **bias" is not eliminated by the use of 
non-language or so-called performance tests, although it may 
be reduced in some instances. In terms of most of the purposes 
for which tests are used, the validity (value) of a ''culture- 
free" test is questioned; a test designed to be equally applicable 
to all cultures may be of little or no practical value in any. 

curricular validity. See validity (2). 

decile. Any one of the nine points (scores) that divide a dis- 
tribution into ten parts, each containing one-tenth of all the 
scores or cases; every tenth percentile. The first decile is the 
1 0th percentile, the eighth decile the 80th percentile, etc. 

deviation. The amount by which a score differs from some 
reference value, such as the mean, the norm, or the score on 
some other test. 

deviation IQ (DIQ). An age-based index of general mental 
ability. It is based upon the difference or deviation between a 
person's score and the typical or average score for persons of 
his chronological age. Deviation IQs from most current scho- 
lastic aptitude measures are standard scores with a mean of 
100 and a standard deviation of 16 for each defined age group. 

dingnostic test. A test used to •'diagnose*' or analyze; that is, 
to locate an individual's specific areas of weakness or strength, 
to determine the nature of his weaknesses or deficiencies, and, 
wherever possible, to suggest their cause. Such a test yields 
measures of the components or subparts of some larger body 
of information or skill. Diagnostic achievement tests are most 
commonly prepared for the skill subjects. 

difficulty value. An index which indicates the percent of some 
specified group, such as students of a given age or grade, who 
answer a test item correctly. 

discriminating power. The ability of a test item to differentiate 
between persons possessing much or little of some trait. 

discrimination index. An index which indicates the discrimi- 
nating power of a test item. The most commonly used index 
is derived from the number passing the item in the highest 27 
percent of the group (on total score) and the number passing 
in the lowest 27 percent. 

distractor. Any incorrect choice (option) in a test item. 

distribution (frequency distribution). A tabulation of the scores 
(or other attributes) of a group of individuals to show the 
number (frequency) of each score, or of those within the 
range of each interval. 



ERLC 



(iiIvalcBt form* Any of two or more forms of a test that are 
Dsely parallel with respect to the nature of the content and 
4 number and difficulty of the items included^ and that will 
eld very similar average scores and measures of variability 
f a given group. (Also referred to as alternate, comparable, 
' parallel form.) 

ror of measurement. Sec standard i kror of mlasukmmi nt. 

[pectancy table (^^expected** achievement). A term with two 
^Rimon usages, related but with some difference: 

(1) A table or other device for showing the relation bc- 
/een scores on a predictive test and some related outcome, 
fie outcome, or criterion status, for individuals at each level 
' predictive score may be expressed as (a) an average on 
e outcome variable, (b) the percent of cases at successive 
vels, or (c) the probability of reaching given performance 
vels. Such tables are commonly used in making predictions 
' educational or job success. 

(2) A table or chart providing for an interpretation of a 
Lident's obtained score on an achievement test with the score 
hich would be ^'expected'* for those at his grade level and 
ith his level of scholastic aptitude. Such ''expectancies" arc 
ised upon actual data from administration of the specified 
;hievement and scholastic aptitude tests to the same student 
>pulation. The term "anticipated^* is also used to denote 
:hievement as differentiated by level of "intellectual status." 

[trapolatlon. In general, any process of estimating values of 
variable beyond the range of available data. As applied to 
8t norms, the process of extending a norm line into grade or 
;e levels not tested in the standardization program, in order 
permit interpretation of extreme scores. Since this extension 
usually done graphically, considerable judgment is involved, 
(trapolated values are thus to some extent arbitrary; for this 
id other reasons, they have limited meaning. 

A symbol denoting the frequency of a given score or of the 
ores within an interval grouping. 

ce validity. See validity ( 1 ) . 

etor. In mental measurement, a hypothetical trait, ability, 
component of ability that underlies and influences perform- 
ice on two or more tests and hence causes scores on the tests 
be correlated. The term "factor" strictly refers to a theo- 
tical variable, derived by a process of factor analysis from 
table of intercorrelations among tests. However, it is also 
ed to denote the psychological interpretation given to the 
riable—i.e., the mental trait assumed to be represented by the 
riable, as verbal ability, numerical ability, etc. 

ctor analysb. Any of several methods of analyzing the in- 
rcorrelations among a set of variables such as test scores., 
ictor analysis attempts to account for the interrelationships 

terms of some underlying "factors," preferably fewer in 
imber than the original variables, and it reveals how much 

the variation in each of the original measures arises from, 

is associated with« each of the hypothetical factors. Factor 
alysis has contributed to an understanding of the organiza- 
n or components of intelligence, aptitudes, and personality; 
d it has pointed the way to the development of "purer" tests 

the several components. 

erJc 



forced*choice item. Broadly, any multiple-choice item in 
which the examinee is required to select one or more of the 
given choices. The term is most often used to denote a special 
type of multiple-choice item employed in personality tests in 
which the options are (1) of equal "preference value," i.e., 
chosen equally often by a typical group, and are (2) such that 
one of the options. discriminates between persons high and low 
on the factor that this option measures, while the other options 
measure other factors. Thus, in the Gordon Personal Profile, 
each of /our options represents one of the four personality 
traits measured by the Profile, and the examinee must select 
both the option which describes him most and the one which 
describes him l^ast. 

frequency distribution. See distribution. 

Denotes Rvnvral intellectual ability; one dimensional meas- 
ure of *'mind," as described by the British psychologist 
Spearman. A test of 'V serves as a general-purpose test of 
mental ability. 

grade equivalent (GE). The grade level for which a given 
score is the real or estimated average. Grade-equivalent inter- 
pretation, most appropriate for elementary level achievement 
tests, expresses obtained scores in terms of f*rad€ and month 
of firade, assuming a lO-monlh school year (e.g., 5.7). Since 
such tests are usually standardized at only one (or two) 
point (s) within each grade, grade equivalents between points 
for which there arc data-based scores must be ^'estimated * by 
interpolation. See iixtrapolation, intlkpolai ion. 

grade norms. Norms based upon the performance of pupils of 
given grade placement. See grade equivalent, norms, per- 
centile RANK, STANINE. 

group test. A test that may be administered to a number of 
individuals at the same time by one examiner. 

individual test. A test that can be administered to only one 
person at a time, because of the nature of the test and/ or the 
maturity level of the examinees. 

intelligence quotient (IQ). Originally, an index of brightness 
expressed as the ratio of a person's mental age to his chrono- 
logical age, MA/CA, multiplied by 100 to eliminate the 
decimal. (More precisely — and particularly for adult ages, at 
which mental growth is assumed to have ceased — the ratio of 
mental age to the mental age normal for chronological age.) 
This quotient IQ has been gradually replaced by the deviation 
IQ concept. 

It is sometimes desired to give additional meaning to IQs 
by the use of verbal descriptions for the ranges in which they 
fall. Since the IQ scale is a continuous one, there can be no 
inflexible line of demarcation between such successive cate- 
gory labels as very superior, superior, above average, average, 
below average, etc.; any verbal classification system is there- 
fore an arbitrary one. There appears to be, however, rather 
common use of the term average or normal to describe IQs 
from 90-109 inclusive. 

An IQ is more definitely ^'interpreted'* by noting the normal 
percent of IQs within a range which includes the IQ, and/or 



[Intelligence quotient (IQ), continued.] 
by indicating its percentile rank or stanine in the tolul na- 
tional norming sample. Column 2 of Tabic 1 shows the nor- 
mal distribution of IQs for M = 100 and S.D. = 16, showing 
percentages within successive 10-point intervals. (For IQs 
whose S.D. is greater than 16, the percentages for the extreme 
IQ ranges will be larger, and those for IQs near the mean will 
be smaller, than those shown in the table.) Table 1 indicates 
that 47 percent, approximately one-half of "dll" persons, have 
IQs in the 20-point range of 90 through 109; an IQ of 140 or 
above would be considered as extremely high, since fewer 
than one percent (0.6) of the total population reach this level, 
and fewer than one percent have IQs below 60. From the 
cumulative percents given in Column 3, it is noted that 3.1 
percent have IQs below 70, usually considered the mentally 
retarded category. This column may be used to indicate the 
percentile rank (PR) of certain IQs. Thus an IQ of 1 19 has a 
PR of 89, since 89.4 percent of IQs are 1 19 or below; an IQ of 
79 has a PR of 10.6, or 1 1. See DtiViATiON IQ, muntal age. 

Table 1. Normal Distribution of IQs with Mean of 100 and 
Standard Deviation of 16 



(1) 


(2) 


(3) 


IQ 


Percent of 


Cumulative 


Range 


Persons 


Percent 


140 and above 


0.6 


100.6 


130-139 


2.5 


99.4 


120-129 


7.5 


96,9 


no-119 


16.0 


89.4 


100-109 




73.4 


90- 99 


50.0 . 


80- 89 


16.0 


26.6 


70- 79 


7.5 


10.6 


60- 69 


2.5 


3.1 


Below 60 


0L6 


0.6 


Total 


100.0 





Inventory test. An achievement test that attempts to cover 
rather thoroughly sonic relatively small unit of specific in- 
struclion or (raining. An inventory lest, as the name suggests, 
is in the nature of a "slock-taking" of an individual's knowl- 
edge or skill, and is often administered prior to instruction. 

item. A single question or exercise in a test, 

item analysis. The process of evaluating single test items in 
respect to certain characteristics. It usually involves determin- 
ing the difficulty value and the discriminating power of the 
item, and often its correlation with some external criterion. 

Kuder-Richardson formula(s). Formulas for estimating the 
reliability of a test that are based on infer-item consistency 
and require only a single administration of the test. The one 
most used, formula 20, requires information based on the 
number of items in the test, the standard deviation of the total 
score, and the proportion of examinees passing each item. The 
Kuder-Richardson formulas are not appropriate for use with 
speeded tests. 

mastery test. A test designed to determine whether a pupil has 
mastered a given unit of instruction or a single knowledge or 
skill; a test giving information on what a pupil knows, rather 
than on how his performance relates to that of some norm- 
reference group. Such tests are used in computer-assisted in- 
struction, where their results are referred to as content- or 
criterion-referenced in formation . 

mean (M). See arithmetic mean. 

median (Md). The middle score in a distribution or set of 
ranked scores; the point (score) that divides the group into 
two equal parts; the 50th percentile. Half of the scores are 
below the median and half above it, except when the median 
itself is one of the obtained scores. 



internal consistency. Degree of relationship among the items 
of a test; consistency in content sampling. .See spiit-hali- 
reliability. 

interpolation. In general, any process of estimating inter- 
mediate values between two known points. As applied to test 
norms, it refers to the procedure used in assigning interpretive 
values (e.g., grade equivalents) to scores between the succes- 
sive average scores actually obtained in the standardization 
process. Also, in reading norm tables it 
is necessary at times to interpolate to 
obtain a norm value for a score between 
two scores given in the table; e.g., in the 
table shown here, a percentile rank of 
83 (from 81 -f of 6) would be as- 
signed, by interpoiaiion, to a score of 
46; a score of 50 would correspond to a percentile rank of 94 
(obtained as 87 -f % of 10). 

inventory. A questionnaire or check list, usually in the form 
of a self-report, designed to elicit non-intellective information 
about an individual. Not tests in the usual sense, inventories 
are most often concerned with personality traits, interests, 
attitudes, problems, motivation, etc. See personality test. 





Percentile 


Score 


Rank 


51 


97 


48 


87 


45 


81 



mental age (MA). The age for which a given score on a men- 
tal a.bility test is average or normal. If the average score made 
by an unselected group of children 6 years, 10 months of age 
is 55, then a child making a score of 55 is said to have a men- 
tal age of 6-10. Since the mental age unit shrinks with in- 
creasing (chronological) age, MAs do not have a uniform 
interpretation throughout all ages. They are therefore most 
appropriately used at the early age levels where nient;il growth 
is relatively rapid. 

modal-age norms. Achievement test norms that are based on 
the performance of pupils of normal age for their respective 
grades. Norms derived from such age restricted groups ai:.e.. 
free from the di.storting influence of the .scores of underage and,* 
overage pupils. 

mode. The score or value that occurs most frequently in a 
distribution. 

multlpleK:hoice item. A test item in which the examinee's task 
is to choose the correct or best answer from several given 
answers or options. 

N. The symbol commonly used to represent the number of 
cases in a group. 



EKLC 



MNi-laiigiuige test See non-verbal test. 

iiOii*yer||Ml test* A test that does not require the use of words 
in the. item or in the response to it. (Oral directions may be 
included in the formulation of the task.) A test cannot, how* 
(ver, be classified as non-verbal simply because it does not 
require reading on the part of the examinee. The use of non- 
irerbal tasks cannot completely eliminate the effect of culture. 

Bonn Hoc. A smooth curve drawn to best fit (1) the plotted 
nean or median scores of successive age or grade groups, or 
[2) the successive percentile points for a single group. 

nomal dlitribotion. A distribution of scores or measures that 
in graphic form has a distinctive bel^shaped appearance. 
Figures 1 and 2 show graphs of such a distribution, known as 
I normal, normal probability, or Gaussian curve. (Difference in 
(hape is due to the different variability of the two distributions.) 
[n such a normal distribution, scores or measures are distributed 
symmetrically about the mean, with as many cases up to various 
Ustances above the mean as down to equal distances below it. 
[^ses are concentrated near the mean and decrease in fre- 
]uency, according to a precise mathematical equation, the 
farther one departs from the mean. Mean and median are 
dentical. The assumption that mental and psychological char- 
icteristics are distributed normally has been very useful in 
:e8t development work. 

norms. Statistics that supply a frame of reference by which 
neaning may be given to obtained test scores. Norms are based 
jpon the actual performance of pupils of various grades or 
iges in the standardization group for the test. Since they rep- 
resent average or typical performance, they should not be re- 
garded as standards or as universally desirable levels of attain- 
nent. The most common types of norms are deviation IQ, 
>ercentile rank, grade equivalent, and stanine. Reference groups 
ire usually those of specified age or grade. 

objective test* A test made up of items for which correct re- 
iponses may be set up in advance; scores are unaffected by the 
}pinion or judgment of the scorer. Objective keys provide for 
K^oring by clerks or by machine. Such a test is contrasted with 
I "subjective** test, such as the usual essay examination, to 
vhich different persons may assign different scores, ratings, 
}r grades. 

imiiilbiis test. A test (1) in which items measuring a variety of 
nental operations are all combined into a single sequence 
'ather than being grouped together by type of operation, and 
[2) from which only a single score is derived, rather than 
leparate scores for each operation or function. Omnibus tests 
nake for simplicity of administration, since one set of direc- 
ions and one overall time limit usually suffice. The Elemen- 
ary, Intermediate, and Advanced tests in the Otis-Lennon 
\4ental Ability Test series are omnibus-type tests, as contrast- 
Jd with the Kuhlmann- Anderson Measure of Academic Po- 
ential, in which the items measuring similar operations occur 
ogether, each with its own set of directions. In a spiral-omni- 
bus test, the easiest items of each type are presented first, fol- 
owed by the same succession of item types at a higher dif- 
Iculty level, and so on in a rising spiral. 



percentile (P)« A point (score) in a distribution at or below 
which fall the percent of cases indicated by the percentile. Thus 
a score coinciding with the 35th percentile (Pur,) is regarded 
as equaling or surpassing that of 35 percent of the persons in 
the group, and such that 65 percent of the performances ex- 
ceed this score. ''Percentile" has nothing to do with the percent 
of correct answers an examinee makes on a test. 

percentile band. An interpretation of a test score which takes 
account of the measurement error that is involved. The range 
of such bands, most useful in portraying significant differences 
in battery profiles, is usually from one standard error of 
measurement below the obtained score to one standard error 
of measurement above it. 

percentile rank (PR). The expression of an obtained test score 
in terms of its position within a group of 100 scores; the per- 
centile rank of a score is the percent of scores equal to or 
lower than the given score in its own or in some external 
reference group. 

performance test. A test involving some motor or manual re- 
sponse on the examinee's part, generally a manipulation of 
concrete equipment or materials. Usually not a paper-and* 
pencil test. 

(1) A ''performance" test of mental ability is one in which 
the role of language is excluded or minimized, and ability is 
assessed by what the examinee does rather than by what he 
says (or writes). Mazes, form boards, picture completion, and 
other types of items may be used. Examples include certain 
Stanford-Binet tasks, the Performance Scale of Wechsler Intel- 
ligence Scale for Children, A rthur Point Scale of Performance 
Tests, Raven's Progressive Matrices, 

(2) "Performance" tests include measures of mechanical 
or manipulative ability where the task itself coincides with 
the objective of the measurement, as in the Bennett Hand- 
Tool Dexterity Test, 

(3) The term "performance" is also used to denote a test 
that is actually a work-sample; in this sense it may include 
paper-and-pencil tests, as, for example, a test in bookkeeping, 
in shorthand, or in proofreading, where no materials other than 
paper and pencil may be required, and where the test response 
is identical with the behavior about which information is 
desired. SRA Typing Skills is such a test. 

The use of the term "performance" to describe a type of 
test is not very precise and there are certain "gray areas." 
Perhaps one should think of "performance" tests as those on 
which the obtained differences among individuals may not be 
ascribed to differences in ability to use verbal symbols. 

personality test. A test intended to measure one or more of the 
non-intellective aspects of an individual's mental or psy- 
chological make-up; an instrument designed to obtain infor- 
mation on the affective characteristics of an individual—emo- 
tional, motivational, attitudinal, etc. — as distinguished from 
his abilities. Personality tests include (1) the so-called person- 
ality and adjustment inventories (e.g., Bernr enter Personality 
Inventory, Bell Adjustment Inventory, Edwards Personal 
Preference Schedule) which seek to measure a person's status 



ERLC 



[personality test, continued.] 

on such traits as dominance, sociability, introversion, etc., by 
means of self-descriptive responses to a series of questions; 
(2) rating scales which call for rating, by one's self or another, 
the extent to which a subject possesses certain traits; and (3) 
opinion or attitude inventories (e.g., Allport-V ernon^Lindzey 
Study of Values, Minnesota Teacher Attitude Inventory). 
Some writers also classify interest, problem, and belief inven- 
tories as personality tests (e.g., Kuder Preference Record, 
Mooney Problem Check List)* See projective technique. 

power test. A test intended to measure level of performance 
unaffected by speed of response; hence one in which there is 
either no time limit or a very generous one. Items are usually 
arranged in order of increasing difficulty. 

practice effect. The influence of previous experience with a test 
on a later administration of the same or a similar test; usually 
an increased familiarity with the directions, kinds of questions, 
etc. Practice effect is greatest when the interval between testings 
is short, when the content of the two tests is identical or very 
similar, and when the initial test-taking represents a relatively 
novel experience for the subjects. 

predictive validity. See vALiorrY (2). 

product*inoment coefficient (r). Also known as the Pearson r. 

See COEFFICIENT OF CORRELATION. 

profile. A graphic representation of the results on several tests, 
for either an individual or a group, when the results have been 
expressed in some uniform or comparable terms (standard 
scores, percentile ranks, grade equivalents, etc.). The profile 
method of presentation permits identification of areas of 
strength or weakness. 

prognosis (prognostic) test A test used to predict future suc- 
cess in a specific subject or field, as the' Pimsleur Language 
Aptitude Battery* 

projective teclinique (projective metliod). A method of person- 
ality study in which the subject responds as he chooses to a 
series of ambiguous stimuli such as ink blots, pictures, unfin- 
ished sentences, etc. It is assumed that under this free-response 
condition the subject "projects" manifestations of personality 
characteristics and organization that can, by suitable methods, 
be scored and interpreted to yield a description of his basic 
personality structure. The Rorschach (ink blot) Technique, 
the Murray Thematic Apperception Test and the Machover 
Draw-a-Person Test are commonly used projective methods. 

quartile. One of three points that divide the cases in a distribu- 
tion into four equal groups. The lower quartile (Qi), or 25th 
percentile, sets off the lowest fourth of the group; the middle 
quartile (Qj) is the same as the 50th percentile, or median, 
and divides the second fourth of cases from the third; and the 
third quartile (Q3), or 75th percentile, sets off the top fourth. 

r* See coefficient of correlation. 

random sample. A sample of the members of some total pop- 
ulation drawn in such a way that every member of the popu- 
lation has an equal chance of being included — that is, in a 
way that precludes the operation of bias or ^'selection." The 
purpose in using a sample free of bias is, of course, the re- 
quirement that the cases used be representative of the total 



population if findings for the sample are to be generalized to 
that population. In a stratified random sample, the drawing of 
cases is controlled in such a way that those chi>sen are "rep- 
resentative" also of specified subgroups of the total popula- 
tion. See REPRESENTATIVE SAMPLE. 

range. For some specified group, the difference between the 
highest and the lowest obtained score on a test; thus a very 
rough measure of spread or variability, since it is based upon 
only two extreme scores. Range is also used in reference to 
the possible spread of measurement a test provides, which in 
most instances is the number of items in the test. 

raw score. The first quantitative result obtained in scoring a 
test. Usually the number of right answers, number right minus 
some fraction of number wrong, time required for perform-, 
ance, number of errors, or similar direct, unconverted, unin- 
terpreted measure. 

readiness test. A test that measures the extent to which an 
individual has achieved a degree of maturity or acquired cer- 
tain skills or information needed for successfully undertaking, 
some new learning activity. Thus a reading readiness test indi- 
cates whether a child has reached a developmental stage where 
he may profitably begin formal reading instruction. Readiness 
tests are classified as prognostic tests. 

recall item. A type of item that requires the examinee to sup- 
ply the correct answer from his own memory or recollection, 
as contrasted with a recognition item, in which he need only 
identify the correct answer. 

Columbus discovered America in the year 

is a recall (or completion) item. See recognition item. 

recognition item. An item which requires the examinee to rec- 
ognize or select the correct answer from among two or more 
given answers (options). 

Columbus discovered America in 
(a) 1425 (b) 1492 (c) 1520 (d) 1546 
is a recognition item. 

regression effect. Tendency of a predicted score to be nearer 
to the mean of its distribution than the score from which it is 
predicted is to its mean. Because of the effects of regression, 
students making extremely high or extremely low scores on a 
test tend to make less extreme scores, i.e., closer to the mean, 
on a second administration of the same test or on some pre- 
dicted measure. 

reliability. The extent to which a test is consistent in measuring 
whatever it does measure; dependability, stability, trustworthi- 
ness, relative freedom from errors of measurement. Reliability 
is usually expressed by some form of reliability coefficient or 
by the standard error of measurement derived from it. 

reliability coefficient. The coefficient of correlation between 
two forms of a test, between scores on two administrations of 
the same test, or between halves of a test, properly corrected. 
The three measure somewhat different aspects of reliability, but 
all are properly spoken of as reliability coefficients. See 

ALTERNATE-FORM RELIABILITY. SPLIT-HALF RELIABILITY COEFFI- 
CIENT, TEST-RETEST RELIABILITY COEFFICIENT, KUDER-RICH- 
ARDSON FORMULA(s). 



ERIC 



re niiiple. A sample that conesponds to or 

Hatches the population of which it is a sample with respect to 
:haracteristics important for the purposes under investigation. 
In. an achievement test norm sample, such significant aspects 
night be the proportion of cases of each sex, from various 
types of schools, different geographical areas, the several 
K)doeconomic levels, etc. 

Kholastk aptitude. See academic apittude. 

ikcwed distributioo. A distribution that departs from symme* 
try or balance around the mean, i.e., from normality. Scores 
pUe up at one end and trail off at the other. 

Spcannan«Brown formula. A formula giving the relationship 
between the reliability of a test and its length. The formula 
;)ermits estimation of the reliability of a test lengthened or 
shortened by any multiple, from the known reliability of a 
given test. Its most common application is the estimation of 
reliability of an entire test from the correlation between its 
two halves. See sptrr-HALF reuabiuty coefficient. 

spUt-balf reHabilHy coefficient A coefficient of reliability ob* 
tained by correlating scores on one half of a test with scores 
Dn the other half, and applying the Spearman-Brown formula 
to adjust for the doubled length of the total test. Generally, 
but not necessarily, the two halves consist of the odd-numbered 
and the even-numbered items. Split-half reliability coefficients 
are sometimes referred to as measures of the internal consist- 
ency of a test; they involve content sampling only, not stability 
over time. This type of reliability coefficient is inappropriate 
for testa in which speed is an important component. 

itandard deviation (S.D.)» A measure of the variability or dis- 
persion of a distribution of scores. The more the scores cluster 
around the mean, the smaller the standard deviation. For a 
normal distribution, approximately two thirds (68.3 percent) 
of the scores are within the range from one S.D. below the 
mean to one S.D. above the mean. Computation of the S.D. 
is based upon the square of the deviation of each score from 
the mean. The S.D. is sometimes called **sigma** and is repre- 
sented by the symbol or. (See Figure 1.) 





34.1% 


34.1% 




J— ^^^^ 








13.6% \. 



IM at 0.4 



14 31 SO 



84 93 94 99.4 99.9 



npM 1. Normal eurw. thowring rttatlont among ttandard daviaHon ditlanea from maan. araa 
[pmxnmn of CMool b<ewaa » moao points, paroontHa rank, and 10 from taata with an 8.D. of 16. 



Standard error (S.E.). A statistic providing an estimate of the 
possible magnitude of "error** present in some obtained meas- 
ure, whether (1) an individual score or (2) some group meas* 
ure, as a mean or a correlation coefficient. 

(1) standard error of measurement (S.E. Meas.): As ap- 
plied to a single obtained score, the amount by which the score 
may differ from the hypothetical true score due to errors of 
measurement. The larger the S.E. Meas., the less reliable the 
score. The S.E. Meas. is an amount such that in about two- 
thirds of the cases the obtained score would not differ by more 
than one S.E. Meas. from the true score. (Theoretically, then, 
it can be said that the chances are 2:1 that the actual score is 
within a band extending from true score minus 1 S.E, Meas, to 
true score plus 1 S.E, Meas,; but since the true score can never 
be known, actual practice must reverse the true-obtained re- 
lation for an interpretation.) Other probabilities are noted 
under (2) below. See true score. 

(2) standard error: When applied to group averages, 
standard deviations, correlation coefficients, etc., the S.E. pro- 
vides an estimate of the "error** which may be involved. The 
group*s size and the S.D. are the factors on which these 
standard errors are based. The same probability interpretation 
as for S.E. Meas. is made for the S.E.s of group measures, i.e., 
2:1 (2 out of 3) for the 1 S.E. range, 19:1 (95 out of 100) 
for a 2 S.E. range, 99:1 (99 out of 100) for a 2.6 S.E. range. 

stand&fd score. A general term referring to any of a variety of 
'*trans formed'* scores, in terms of which raw scores may be 
expressed for reasons of convenience, comparability, ease of 
interpretation, etc. The simplest type of standard score, known 
as a z-score, is an expression of the deviation of a score from 
the mean score of the group in relation to the standard devi- 
ation of the scores of the group. Thus: 

^ , , raw score (X) — mean (M) 

standard score (Z) = ^ — i ^ — - 

standard deviation (S.D.) 

Adjustments may be made in this ratio so that a system of 
standard scores having any desired mean and standard devia- 
tion may be set up. The use of such standard scores does not 
affect the relative standing of the individuals in the group or 
change the shape of the original distribution. T-scores have a 
M of 50 and an S.D. of 10. Deviation IQs are standard scores 
with a M of 100 and some chosen S.D., most often 16; thus 
a raw score that is 1 S.D. above the M of its distribution would 
convert to a standard score (deviation IQ) of 100 + 16 = 116. 
(See Figure 1.) 

Standard scores are useful in expressing the raw scores of 
two forms of a test in comparable terms in instances where 
tryouts have shown that the two forms are not identical in 
difficulty; also, successive levels of a test may be linked to 
form a continuous standard-score scale, making across-battery 
comparisons possible. 

standardized test (standard test). A test designed to provide a 
systematic sample of individual performance, administered ac- 
cording to prescribed directions, scored in conformance with 
definite rules, and interpreted in reference to certain norma- 
tive information. Some would further restrict the usage of the 
term "standardized" to those tests for which the items have 
been chosen on the basis of experimental evaluation, and for 
which data on reliability and validity are provided. Others 
would add "commercially published** and/ or "for general use.** 



43 



46 



stanine* One of the steps in a nine-point scale of standard scores. 
The stanlne (short for standard-nine) scale has values from 1 
to 9, with a mean of 5 and a standard deviation of 2. Each 
stanine (except I and 9) is Vi S.D. in width, with the middle 
(average) stanine of 5 extending from S.D. below to Va 
S.D. above the mean. (See Figure 2.) 




P«re«nt o< Soortt 

ApproniiTMt* Rangt 
of Pirc«ntil« flankt 



Standard O«vtotlon 



0% 


7% 


12% 


17% 


20% 


17% 


12% 


7% 




B«lowft 


6*11 


1233 


24-40 


41^ 


61-77 


78-89 


90-96 


Above 94 





















1W>1*/«v->A#-V4# ^^M0 *Wa9 
n^wr* 2. Stininta and th« normal curv«. Each tianin* (except 1 end 9) ia one half S.D. in width. 



survey test. A test that measures general achievement in a 
given area, usually with the connotation that the test is in- 
tended to assess group status, rather than to yield precise 
measures of individual performance. 

U A critical ratio expressing the relationship of some measure 
(mean, correlation coefficient, difference, etc.) to its standard 
error. The size of this ratio is an indication of the significance 
of the measure. If r is as large as 1.96, significance at the .05 
level is indicated; If as large as 2.58, at the .01 level. These 
levels indicate 95 or 99 chances out of 100, respectiv ely. 

taxonomy. An embodiment of the principles of classification; 
a survey, usually in outline form, such as a presentation of the 
objectives of education. 

test-retest reliability coefficient. A type of reliability coefficient 
obtained by administering the same test a second time, after 
a short interval, and correlating the two sets of s':ores. *'Same 
test'* was originally understood to mean identical content, i.e., 
the same form; currently, however, thi^ term "test-retest" is 
also used to describe the administration of different forms of 
the same test, in which case this reliability coefficient becomes 
the same as the alternate-form coefficient. In either case ( 1 ) 
fluctuations over time and in testing situation, and (2) any 
effect of the first test upon the second are involved. When the 
time interval between the two testings is considerable, as sev- 
eral months, a test-rctest reliability coefficient reflects not only 
the consistency of measuremeut provided by the test, but also 
the stability of the examinee trait being measured. 



trur score. A score entirely free of error; hence, a hypothetical 
vslue that can never be obtained by testing, which always in- 
volves some measurement error. A "true" score may be thought 
of as the average score from an infinite number of meas* 
urements from the same or exactly equivalent tests, assuming 
no practice effect or change in the examinee during the test* 
ings. The standard deviation of this infinite number of "samp- 
lings" is known as the standard error of measurement. 

validity. The extent to which a test docs the job for which it 
is u^d. This definition is more satisfactory than the traditional 
"extent to which a tost measures what it is supposed to meas- 
ure," since the validity of a test is always specific to the pur- 
poses for which the test is used. The term validity, then, has 
different connotations for various types of tests and, thus, a 
different kind of validity evidence is appropriate for each. 

(1) content, curricular validity. For achievement tests, 
validity is the extent to which the content of the test represents 
a balanced and adequate sampling of the outcomes (knowl- 
edge, skills, etc.) of the course or instructional program it is 
intendf^d to cover. It is best evidenced by a comparison of the 
test content with courses of study, instructional materials, and 
statements of educational goals; and often by analysis of the 
processes required in making correct responses to the items. 
Face validity, referring to an observation of what a test ap- 
pears lo measvare, is a non-technical type of evidence; apparent 
relevancy is, however, quite desirable. 

(2) criterion-related validity. The extent to which scores 
on the test are in agreement with (concurrent validity) or pre- 
dict (predictive validity) some given criterion measure. Pre- 
dictive validity refers to the accuracy with which an aptitude, 
prognostic, or readiness test indicates future learning success 
\n some area, as evidenced by correlations between scores on 
the test and future criterion measures of such success (e.g., the 
relation of ^core on an academic aptitude test administered in 
high school to grade point average over four years of college). 
In concurrei\t validity, no significant time interval elapses be- 
tween administration of the tsst being validated and of the 
criterion measure. Such validity might be evidenced by co/i- 
current measures of academic ability and of achievement, by 
the relation of a r ew test to one generally accepted as or known 
to be valid, or by the correlation between scores on a test and 
criteria measures which are valid but are less objective and 
more time-consuming to obtain than a test score would be. 

(3) construct validity. The extent to which a test measures 
some relatively abstract psychological trait or construct; ap- 
plicable in evaluating the validity of tests that have been con- 
structed on the basis of an analysis (often factor analysis) of 
the nature of the trait and its manifestations. Tests of person- 
ality, verbal ability, mechanical aptitude, critical thinking, 
etc.. are validated in terms of their construct and the relation 
of their scores to pertinent external data. 

variability. The spread or dispersion of test scores, best indi- 
cated by their standard deviation. 

variance. For a distribution, the average of the squared devia- 
tions from the mean; thus the square of the standard deviation. 



TEST SERVICE NOTEBOOKS are issued from time to time as a professional service of The Psychological Corporation. Inquiries, 
comments, or requests for additional copies may be addressed to the office nearest you. Write: Advisory Services, The Psychological 
Corporation, New York, NY 10017 • Chicago, II 60648 • San Francisco, CA 94109 • Atlanta, GA 30309 • Dallas, TX 75235 

44 4 



Appendix B 

Summary of Common Test Scores 



48 



SCORES FREOUENTLy ASSOCIATED yiTH NORH REFERENCED TESTS 



DEFINITIDH 

The percentile rank establishes 
a student's standing relative to 
a norm group In terms of the per- 
centage of students Mho scored at 
or below his or her raw score. 
For example, a student y^ho scored 
at the 98th percentile achieved 
a m score which was higher than 
the raw scores of 98 percent of 
the norm group who took the same 
test under the saine conditions. 



HAJOR ADVANTAGES 

1. Percentiles show the relative standing 
of Individuals compared to a nonnatlye 
group. 

2. They are familiar to most public school 
personnel, though probably not the 
9enera] public. 

3. Percentiles are relatively easily 
explained. 



JDOR DlSADVAflTAGES 

1. Percentiles are frequently confused 
with the percent of the total number 
of test Items answered correctly. 

2. Since the percentile scale does not 
have equal units of measurement > per- 
centiles should not be used In the 
computation of group statistics. 



The grade equivalent score Indi- 
cates the perfoimance of a student 
on a particular test relative to 
the median performance of students 
at a given grade level and month; 
e.g., a fifth grader who receives 
a grade equivalent score of 8.2 on 
a reading test achieved the same 
raw score performance as the typi- 
cal eighth grader In the second 
month of eighth grade would be 
expected to achieve on the same 
fi fth grade test . 



It appears easy to coMnlcate the 
standing of an Individual student rela- 
tive to a grade level (most people 
believe they understand what Is meant 
by grade equivalent scores). 



Grade equivalents are easily misunder- 
stood and misinterpreted. 

Achievement expressed In grade equi- 
valent score units cannot be meaning- 
fully contpared with each other In 
several instances. 

a. Grade equivalent scores cannot be 
meaningfully compared for the same 
student (or group of students) over 
time. 

b. Grade equivalent scores cannot be 
meaningfully compared for the same 
student (or group of students) across 
subject matter areas. 

c. Grade equivalent scores rannol be 
meaningfully compared for the same 
student (or group of students) across 
different tests. 

Hany grade equivalent scores are statistical 
projections (Interpolations or extrapolations)* 
In the later grades It is not unconnion to find 
grade equivalent scores of two or three grade 
levels above or below the student's actual 
^rade level, but these scores are of doubtful 
accuracy. 

The grade equivalent scale Is not composed of 
equ^l sized units. Having equal sized units 
Implies that the underlying difference between 
any two scores Is the same tlironghout the scale. 



50 



I 



SCORES FREQUENHy ASSOCIATED HITII NORH REFERENCED TESIS 



CD 



Ubi 
J 

TM 

oa 



51 



DEFINITION 



HAJOR ADVANTAGES 



HAJOR DISADVANTAGES 



Standard scores are derived from 
raM scores, but express the results 
of a test on the sane nmnerlca] 
scale regardless of grade level, 
subject area or test employed. 



1. Since the mean and standard deviation 
of the standard score scales are pre- 
specified, a student's standard score 
iimediately cowunicates tw important 
facts about his or her performance on 
that test: 

a. Whether the student's score is 
above or beloM the mean. 

b. HoM far above or below the mean, 
in standard deviation units, his 
or her perfomiance is, 

2. The constant numerical scale of standard 
scores facilitates comparisons; 

a. Across students taking the same 
test. 

b. Across subject matter areas for the 
same student. 

3. Standard scores are derived in a way 
that maintains the equal interval pro- 
perty in their units which is absent 

in percentile and grade equivalent scores. 
Therefore, sumary statistics may be 
meaningfully interpreted when calculated 
on standard scores. 



1. 



2. 



The most useful interpretation of standard 
scores requires some knowledge of statistics 
(i.e., mean and standard deviation) and 
hence may not be appropriate for audiences 
who have not been exposed to these concepts 
(e.g., parents, the, news media). 



Given the variety of standard scores available, 
there may be potential confusion in expressing 
the same test performance with so many different 
numerical values. 



3. The conversion of raw scores to standard scores 
may either niaintain the shape of the distribution 
observed, or may transfonn the distribution to 
another, more interpretively convenient shape 
(e.g., the normal distribution); and the pro- 
cedures employed in specifying 'the conversion 
process may not be Innediately obvious, 



A standard score system having 99 1. Same as standard score systems, 
equal intervals. The average corres- 
ponds to the SOth centlle; the 1st I 2. 
99th NCEs correspond to the 1st S 99th 
centiles. Range: generally 1-99 
but can be higher and lower. 



Permit aggregation of data from a wide 
variety of tests. 



1. They are relatively new,. 

2. They depend upon standard scores or 
percentiles, 

3. Not all test publishers use them. 



^2 



ERIC 



SCORES FREQUENTLY ASSOCIATED NITH NORH REFERENCED TESTS 



DEFINITION 



WJQR ADVANTAGES 



HAJOR DISADVANTAGES 



Expanded scale scores area type of 
standard score whose scale Is 
designed to extend across grade 
levels and mean Increases 
progressively as the grade level 
Increases. 



Expanded scores facilitate longitudinal 
comparisons of an individual across 
grade levels. 



2, Expanded scale scores provide the vehicle 
for expressing a perfonnance obtained at 
one grade level to the norm group of 
another. This is useful Hhen the appro- 
priate level of a test to be administered 
to a student is Judged to be other than 
that of his or her grade level (i.e^ 
functional level testing). 

3, Since they Mere designed as equal 
Interval, their scores may be mathemati^ 
cally manipulated (e.g., averaged). 



Different test publishers use different 
terms to refer to their expanded scale 
scores (e.g., growth scale values, 
achievement development scale scores, 
standard score, scale score) and this 
may be confusing when considering results 
from different tests. 

Different tests use different ranges, 
and standaH deviations in deriving 
their expanded scale scores. Thus, 
results from different tests expressed 
in expanded scale score units cannot be 
readily compared. 



3. The statistical properties of expanded 
scale scores are often not as unifomi 
as theoretically desired. 



Stanines are a standard score scale 
consisting of nine values with a 
mean' of five and a standard devia^ 
tlon of two. 

If the distribution of scores Is 
normal, each stanine includes a 
known proportion of the scores 
in the distribution. 



1. As in all standard scores, stanines have 
the same meaning across different tests, 
different grade levels and different 
content areas. 

I Stanines consist of only nine possible 
scores and thus may be easier to comnun- 
icate to audiences not familiar with 
measurement terminology. Verbal labels 
may be given to each stanine value to 
facilitate interpretation. 



Since some of the stanines encompass 
a wide range of scores, their use in 
reporting can be insensitive to differ- 
ences between students' performance 
that are more apparent from the use of 
other test scores. 




SCORES FREQUENTLY ASSOCIATED UlTll OBJECTIVE REFERENCED TESTS 



DEFIHITION 



NAJOR ADVANTAGES 



MAJOR DISADVANTAGES 



The nuiiber of items on a test or 
subtest answered correctly by the 
student. 



1. Virtually no statistical or measure- 
ment expertise Is needed to calculate 
m scores, 

2. Raw scores are the necessary first step 
in expressing test performance in any of 
a number of other Mays (e.g., standard 



By themselves, m scores offer no indication 
as to hoM a student who has mastered the skills 
represented on the test "should" perform 
(I.e., criterion referenced) or how other 
students at the same 9rade level have performed 
(i.e., norm referenced.) 



scores, percentiles.) 



The proportion of the total number 
of items answered torrectly by the 
student. 



1. Very little statistical or measurement 1 
expertise Is required to understand this 
expression of test performance. 

I If the content area is sufficiently 
represented by the Items on the test, 
the percent correct provides an expression 
of the proportion of the subject matter 
mastered by the student. 



No notion of test difficulty or expected 
performance Is contained in this score. 
Unless accompanied by a standard fur mastery 
or infomiation as to how a student's peers have 
performed In the test, misinterpretations may 



arise. 



1. 



Uhen a standard for mastery has been 1. The objective mastery score compares the 

applied to a set of items for a sped- student's performance on that objective 

fic objective, a student's performance to a judged standard of what he or she 

In terms of that objective Is expressed should know of the skills required to master 

as having mastery or non-mastery of it. This score can be very useful in 

the objective. diagnosing a student's specific strengths 

and weaknesses. 



2. yhen the subject matter requires a 
successive accumulation of skills (e.g., 
elementary math), objective mastery 
scores may be extretnely useful in 
monitoring the progress of students in 
specific skill areas. 



Objective mastery scores are difficult to 
compare across different tests. Items designed 
to measure the same objective may differ in 
difficulty or have different standards for 
mastery on different tests. 



If a purpose In testing Is to differentiate 
among students, objective mastery scores do 
not present a very useful Index. Different 
raw scores above or below the mastery level 
are viewed as the same-'eitlier mastery or 
non-mastery. 



